NB: This document filters Ontario FSALDU rows from the raw DemoStats2020 data

pacman::p_load(psych, xray , DT, excelR,  
               sjmisc, sjlabelled, sjstats, sjPlot, dplyr, 
               knitr, kableExtra, captioner)

1 Statement of Authorship

This documents has been prepared by our consulting team (listed below) without the help of any other person: Himanshu Kamra, Fowad Ijaz, Jaspreet Gulati, and Gabriella Paniccia

2 Executive Summary

Using K-means clustering and a complimentary GLM logistic regression model, 4 segments were identified as key segments to target for home and property insurance partnerhsip purposes, and they include:

Segment 1: Open-Minded Renters | These individuals (26-64) are the more tech enthusiastic out of all renters, however their smart tech adoption is medium, having smart phones, smart thermostats, and smart home assistants. Although they are excited about technology, they are not likely to purchase smart home technologies in the future. This could be because there is not enough incentive for them to do so (e.g.lowering insurance premiums, simplifying their life, etc.). There is also a huge opportunity to educate these individuals on the ease of use of these technologies as they are middle-aged and live busy lives with most having one to two children in an apartment. While they don’t plan on owning smart home technology, they are considered to be open-minded individuals as they are the most willing segment to share their mobile information, are enthusiastic about technology in general, trust banking/financial apps, and are therefore likely to convert. Overall, they are technologically-enabled, but need that extra push to continue to purchase more smart home technology. Location : Scattered all across Ontario

Segment 2: Low-tech Homeowners | This segment owns homes and falls in the 40-64 age bracket, and are not only slow to adopting smart home technology, but don’t plan on purchasing it in the future. They are home owners and live in 3-4 person households and given that they have medium-to-high home insurance premiums, there is an opportunity to educate them on the impact that smart home technology could have on their premiums. They are not tech enthusiasts, but they currently own smart phone and smart home assistants. Interestingly, they are most likley to consent to mobile information sharing. Location : Northern and Southern Ontario

Segment 3: Migrant Technophiles | These individuals are very excited by technology. They own the most amount of smart technologies, including: smart phone, smart home thermostat, smart home security system, smart home assistant, smart home appliances and lighting. They are also likely to purchase more smart home technology in the future (specifically smart home assistants). Most have 2 children per household, are employed and have a university degrees as their highest level of education (the highest amongst the 4 segments). Notably, they have the highest combined household income of roughly $148,000. They are also the most confident in big businesses as well as banking/financial applications, which is a plus for RBCI. Location : Southern Ontario

Segment 4: Relaxed Retirees | This segment is 65+ and pays high insurance premiums, but is slow to smart home technology adoption. Most have a smart thermostat and a smart home assistant, but don’t plan on purchasing more smart home technology in the future. These individuals (or their children) need a lot of education in order for them to become more technologically-enabled. Considering that they have the highest insurance premiums the relaxed retirees, emphasize opportunities to decrease insurance premiums with the adoption of certain smart home technologies. They are also confident in big businesses, which is aplus for RBCI. Location : In or surounsding cities (non-rural)

These segments will be communicated to using specific messaging and marketing channels, which are detaild further in the document.

Partnership opportunities that are detailed towards the end of the document include: Google Nest, Ring, Amazon Echo, and General Electric.

3 Background

People don’t wake up thinking “I need to go buy insurance”. As such RBC Insurance (RBCI) has tasked us with investigating where insurance can be logically placed and contribute to the client value proposition. Smart devices and home automation are growing industries and very important to the insurance business as a mechanism to help engage and educate consumers, prevent loss, and provide data and experience to help price risk.

With new entrants delivering benefits to our clients in many ways, some smart and connected devices provide a logical connection with the insurance value proposition. To assess a potential partnership ecosystem around smart home automation, RBCI wants to understand which market segments are interested in smart technology, the type of technology they are interested in and where they are located across Canada.

With that being said, the project scope, or business opportunity, has been narrowed down to the following: By leveraging smart home technology, who in Ontario should RBC Insurance target for its home & property insurance solutions?

Environics Analytics has kindly shared multiple datsets with our team, and they included 6,375 variables. Our team has narrowed down the scope of the project on home and property insurance. The resulting variables list has therefore been shortened to include the following 32 variables :

Technology Variables

  • Technology Enthusiast - Are you enthusiastic about technology?
  • Mobile Device Marketing Consent - Do you consent to mobile marketing?
  • Mobile Information Sharing Consent - Do you often consent to sharing your mobile information?
  • Smartphone Purchase People - Do you make purchases on your smart phone?
  • Smart Thermostat Owners - Do you currently own a smart thermostat?
  • Smart Home Security Owners - Do you currently have smart home security?
  • Smart Home Assistant Owners - Do you currently own a smart home assistant?
  • Smart Home Devices Owners - Do you currently own smart home devices?
  • Smart Thermostat Owners (Planning) - Do you plan to purchase a smart thermostat?
  • Smart Home Security Owners (Planning) - Do you plan to purchase smart home security systems/smartlocks?
  • Smart Home Assistant Owners (Planning) - Do you plan to purchase smart home assistants (e.g. Google Home, Amazon Echo, etc)?
  • Smart Home Devices Owners (Planning) - Do you plan to purchase a smart home devices, appliances, lighting?
  • Financial App Trust - Do you trust Banking/Financial Apps?

Socio-Demographic Variables

  • Total Population Median Age
  • Male Population Median Age
  • Female Population Median Age
  • Average Number Of Persons In Private Households
  • Total Family Households
  • Non-Family Households
  • Average Children Per Census Family Household
  • Houses
  • Apartment, Building Low And High Rise
  • Condos
  • Median Household Income
  • High School Certificate Or Equivalent
  • College, CEGEP Or Other Non-University Certificate Or Diploma
  • University Certificate Or Diploma Below Bachelor
  • Employed
  • Unemployed
  • Total Immigrant

Insurance Variables

  • Annual Premiums - Home Insurance - What are you annual premiums for home insurance?

Business Variables * Big Business Confidence - Are you confident in Big Businesses?

4 Objective

The objective of this project is to use the Environics Analytics Data to perform a segmentation analysis and to ultimately build segment profile and recommend opportunities for smart technology partnerships for RBCI. The locations and marketing strategies for these segments will also be identified.

5 Methods

  1. The initial Environics Analytics datasets were analyzed to identify 33 relevant variables for the project. These variables were then combined into one csv file, called “Project_Cleaned_File.csv”. Some variables needed to be divided by “Total Population” or “Total Household”.
  2. The combined dataset of 33 pre-selected variables was read into R Markdown. Data was cleaned, variables were re-ordered and variable names were renamed.
  3. Following this, a segmentation analysis was performed using K-means Clustering.
  • Level 1 segmentation was performed using attitude variables (predominately technology-related variables).
  • ML models (GBM, GLM, and RF) were performed using h20 and the quality of them was assessed.
  • Level 2 segmentation was performed using demographic variables.
  • ML models (GBM, GLM, Deep Learning, and RF) were performed using h20 and the quality of them was assessed.
  1. Segment Persona’s were created.
  2. Some visualizations of the selected model were then prepared.
  3. Given the results of the above analyses, a final recommendation was provided.

6 Analysis

library(h2o)
h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\jaspr\AppData\Local\Temp\RtmpSabFlD\file2388650737da/h2o_jaspr_started_from_r.out
    C:\Users\jaspr\AppData\Local\Temp\RtmpSabFlD\file2388129c246f/h2o_jaspr_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 339 milliseconds 
    H2O cluster timezone:       America/Toronto 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.32.0.1 
    H2O cluster version age:    2 months and 9 days  
    H2O cluster name:           H2O_started_from_R_jaspr_cyu991 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.95 GB 
    H2O cluster total cores:    12 
    H2O cluster allowed cores:  12 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.3 (2020-10-10) 

6.1 Reading the data provided by Environics Analytics (EA).

We will be using H2o package to read the data provided by EA.

6.2 Finding the column numbers of the variables identified as basis variables

Since there are many variables present in each data sheet, we need to select only the required variables. In the Next steps, we are selecting the required variables from each data file for further operations.

6.2.1 Social Values

df1.hex <- h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/SocialValues_2020_fsaldu_06nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df1.hex ) == "SV00015") # Currently own - Smart phone
[1] 18
which( h2o.colnames(df1.hex ) == "SV00028") # Currently own - Smart phone
[1] 31
which( h2o.colnames(df1.hex ) == "CODE")
[1] 1
#which( h2o.colnames(df1.hex ) == "M2DC1") # Currently own - Smart phone
#which( h2o.colnames(df1.hex ) == "M224C45") # Currently own - Smart phone
#which( h2o.colnames(df1.hex ) == "M225C45") # Currently own - Smart phone

The code-chunk below extracts the high-potential basis variables into a smaller file.

df1s.hex<- as.h2o(df1.hex[, c(1,18,31)]) # copy selected columns to new dataframe
names(df1s.hex)
[1] "CODE"    "SV00015" "SV00028"

Since these files are so large, purge them from active duty after extracting the key variables.

rm(df1.hex) # remove objects you won't use again to free up space

6.2.2 DemoStats

Just repeat the same sequence of operations on each of the remaining data files.

df2.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/DemoStats2020_fsaldu_05nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df2.hex ) == "CODE")
[1] 1
which( h2o.colnames(df2.hex ) == "ECYBASPOP")
[1] 5
which( h2o.colnames(df2.hex ) == "ECYPTAMED") # Currently own - Smart phone
[1] 42
which( h2o.colnames(df2.hex ) == "ECYPMAMED")
[1] 63
which( h2o.colnames(df2.hex ) == "ECYPFAMED")
[1] 84
which( h2o.colnames(df2.hex ) == "ECYHSZAVG")
[1] 165
which( h2o.colnames(df2.hex ) == "ECYHTYFHT")
[1] 167
which( h2o.colnames(df2.hex ) == "ECYHTYNFH")
[1] 170
which( h2o.colnames(df2.hex ) == "ECYCHAFHCH")
[1] 255
which( h2o.colnames(df2.hex ) == "ECYSTYHOUS")
[1] 274
which( h2o.colnames(df2.hex ) == "ECYSTYAPT")
[1] 278
which( h2o.colnames(df2.hex ) == "ECYCDOCO")
[1] 286
which( h2o.colnames(df2.hex ) == "ECYHNIMED")
[1] 323
which( h2o.colnames(df2.hex ) == "ECYEDUHSCE")
[1] 331
which( h2o.colnames(df2.hex ) == "ECYEDUCOLL")
[1] 333
which( h2o.colnames(df2.hex ) == "ECYEDUUD")
[1] 335
which( h2o.colnames(df2.hex ) == "ECYACTEMP")
[1] 349
which( h2o.colnames(df2.hex ) == "ECYACTUEMP")
[1] 350
which( h2o.colnames(df2.hex ) == "ECYTIMIMGT")
[1] 540
df2s.hex<- as.h2o(df2.hex[, c(1,5,42,63,84,165,167,170,255,274,278,286,323,331,333,335,349,350,540   )]) # copy selected columns to new dataframe
names(df2s.hex)
 [1] "CODE"       "ECYBASPOP"  "ECYPTAMED"  "ECYPMAMED"  "ECYPFAMED" 
 [6] "ECYHSZAVG"  "ECYHTYFHT"  "ECYHTYNFH"  "ECYCHAFHCH" "ECYSTYHOUS"
[11] "ECYSTYAPT"  "ECYCDOCO"   "ECYHNIMED"  "ECYEDUHSCE" "ECYEDUCOLL"
[16] "ECYEDUUD"   "ECYACTEMP"  "ECYACTUEMP" "ECYTIMIMGT"
rm(df2.hex) # remove objects you won't use again to free up space

6.2.3 Money Matters

df3.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/MoneyMatters5_2020_fsaldu_06nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df3.hex ) == "CFM0876I")
[1] 163
which( h2o.colnames(df3.hex ) == "CODE")# Currently own - Smart phone
[1] 1
df3s.hex<- as.h2o(df3.hex[, c(163,1   )]) # copy selected columns to new dataframe
names(df3s.hex)
[1] "CFM0876I" "CODE"    
rm(df3.hex) # remove objects you won't use again to free up space

6.2.4 Mobile 1

df4.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/Mobile1_2020_fsaldu_06nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df4.hex ) == "M221C45")
[1] 41
which( h2o.colnames(df4.hex ) == "M222C45")
[1] 42
which( h2o.colnames(df4.hex ) == "M3F3C3")# Currently own - Smart phone
[1] 87
which( h2o.colnames(df4.hex ) == "CODE")
[1] 1
df4s.hex<- as.h2o(df4.hex[, c(1,41,42,87)]) # copy selected columns to new dataframe
names(df4s.hex)
[1] "CODE"    "M221C45" "M222C45" "M3F3C3" 
rm(df4.hex) # remove objects you won't use again to free up space

6.2.5 Mobile 2

df5.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/Mobile2_2020_fsaldu_06nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df5.hex ) == "M2X33C31")
[1] 20
which( h2o.colnames(df5.hex ) == "M2X32C32")
[1] 21
which( h2o.colnames(df5.hex ) == "M2X33C33")
[1] 22
which( h2o.colnames(df5.hex ) == "M2X33C34")
[1] 23
which( h2o.colnames(df5.hex ) == "M2Y33C31")
[1] 47
which( h2o.colnames(df5.hex ) == "M2Y32C32")
[1] 48
which( h2o.colnames(df5.hex ) == "M2Y33C33")
[1] 49
which( h2o.colnames(df5.hex ) == "M2Y33C34")
[1] 50
which( h2o.colnames(df5.hex ) == "CODE")# Currently own - Smart phone
[1] 1
df5s.hex<- as.h2o(df5.hex[, c(1,20,21,22,23,47,48,49,50)]) # copy selected columns to new dataframe
names(df5s.hex)
[1] "CODE"     "M2X33C31" "M2X32C32" "M2X33C33" "M2X33C34" "M2Y33C31" "M2Y32C32"
[8] "M2Y33C33" "M2Y33C34"
rm(df5.hex) # remove objects you won't use again to free up space

6.2.6 Mobile6

df6.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/Mobile6_2020_fsaldu_06nov20.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
which( h2o.colnames(df6.hex ) == "M3Q241C45")
[1] 79
which( h2o.colnames(df6.hex ) == "CODE")# Maps/navigation apps - Trust
[1] 1
df6s.hex<- as.h2o(df6.hex[, c(1,79)]) # copy selected columns to new dataframe
h2o.names(df6s.hex)
[1] "CODE"      "M3Q241C45"
rm(df6.hex) # remove objects you won't use again to free up space

6.3 Combining variables selected from the 6 mobile datasets

Now, the data frames are bound together to form one data frame.

dim(df1s.hex); dim(df2s.hex); dim(df3s.hex); dim(df4s.hex); dim(df5s.hex);   dim(df6s.hex)
[1] 285374      3
[1] 285374     19
[1] 285374      2
[1] 285374      4
[1] 285374      9
[1] 285374      2
df.hex<-merge( df1s.hex, df2s.hex, by= 'CODE') # BINDING TOGETHER THE BASIS VARIABLE DATA FRAMES
df.hex1<- merge(df.hex, df3s.hex, by='CODE')
df.hex2<- merge(df.hex1, df4s.hex, by='CODE')
df.hex3<- merge(df.hex2, df5s.hex, by='CODE')
df.hex4<- merge(df.hex3, df6s.hex, by='CODE')
rm( "df1s.hex", "df2s.hex", "df3s.hex", "df4s.hex", "df5s.hex" , "df6s.hex", "df.hex","df.hex1", "df.hex2", "df.hex3")

6.4 Renaming and Reordering of Variables

Since the original variable names were not descriptive, and the full variable names are long, we’ve adjusted them to provide context in our analyses. You’ll see in the code below, but an example is “Big Business Confidence” to “Confidence”.

  • SV00015 : Big Business Confidence –> Confidence
  • SV00028 : Technology Enthusiast –> Tech Enthu.
  • ECYPTAMED : Total Population Median Age –> Pop_Age
  • ECYPMAMED : Male Population Median Age –> M_Pop_Age
  • ECYPFAMED : Female Population Median Age –> F_Pop_Age
  • ECYHSZAVG : Average Number Of Persons In Private Households –> **#_Pvt_HHld**
  • ECYHTYFHT : Total Family Households –> **#_Fam_Hhld**
  • ECYHTYNFH : Non-Family Households –> **#_NF_Hhld**
  • ECYCHAFHCH : Average Children Per Census Family Household –> Avg_Chld_PrCen_Fm_Hhld
  • ECYSTYHOUS : Houses –> Houses
  • ECYSTYAPT : Apartment, Building Low And High Rise –> Apt
  • ECYCDOCO : Condos –> Condos
  • ECYHNIMED : Median Household Income –> Hhld Income
  • ECYEDUHSCE : High School Certificate Or Equivalent –> High School
  • ECYEDUCOLL : College, CEGEP Or Other Non-University Certificate Or Diploma –> Clg_CEGEP_Non_Uni_Dip
  • ECYEDUUD : University Certificate Or Diploma Below Bachelor –> Uni_Dip_Blw_Bach
  • ECYACTEMP : Employed –> Empl
  • ECYACTUEMP : Unemployed –> UnEmp
  • ECYTIMIMGT : Total Immigrant –> Tot_Mig
  • CFM0876I : Annual Premiums - Home Insurance –> Ann_Prem_Hm_Insu
  • M221C45 : Mobile Device Markeking Consent –> Mob_Inf_sec_cons
  • M222C45 : Mobile Information Sharing Consent –> Mob_Inf_sec_cons
  • M3F3C3 : Smartphone Purchase People –> #Smart_purch
  • M2X33C31 : Smart Thermostat Owners –> #Smt_Therm
  • M2X32C32 : Smart Home Security Owners –> #Smt_Hm_Secu
  • M2X33C33 : Smart Home Assistant Owners –> #Smt_Hm_Ass
  • M2X33C34 : Smart Home Devices Owners –> Smt_Hm_Dev
  • M2Y33C31 : Smart Thermostat Owners- Planning –> #Smat_Therm-Pln
  • M2Y32C32 : Smart Home Security Owners-Planning –> #Smt_Hm_Sec-Pln
  • M2Y33C33 : Smart Home Assistant Owners-Planning –> #Smt_Hm_Asst-Pln
  • M2Y33C34 : Smart Home Devices Owners-Planning –> #Smt_Hm_Dev-Pln
  • M3Q241C45 : Financial App Trust –> Fin_App_Trst

The variables were also reordered so that technology-related ones were grouped together and demographic-related ones were grouped together. We can see through the structure of fs1 that all variables have been recoded correctly.

new_names<-c(SV00015="Confidence", SV00028= "Tech Enthu.",ECYPTAMED= "Pop_Age",ECYPMAMED= "M_Pop_Age",ECYPFAMED= "F_Pop_Age",ECYHSZAVG= "#_Pvt_HHld",ECYHTYFHT= "#_Fam_Hhld",ECYHTYNFH= "#_NF_Hhld",ECYCHAFHCH= "Avg_Chld_PrCen_Fm_Hhld",ECYSTYHOUS= "Houses",ECYSTYAPT= "Apt",ECYCDOCO= "Condos",ECYHNIMED="Hhld Income",ECYEDUHSCE="High School",ECYEDUCOLL="Clg_CEGEP_Non_Uni_Dip",ECYEDUUD="Uni_Dip_Blw_Bach",ECYACTEMP="Empl",ECYACTUEMP="UnEmp",ECYTIMIMGT="Tot_Mig",CFM0876I="Ann_Prem_Hm_Insu",M221C45="Mob_Mark_cons",M222C45="Mob_Inf_sec_cons",M3F3C3="# Smart_purch",M2X33C31="#Smt_Therm",M2X32C32="#Smt_Hm_Secu",M2X33C33="#Smt_Hm_Ass",M2X33C34="Smt_Hm_Dev",M2Y33C31="#Smat_Therm-Pln",M2Y32C32="#Smt_Hm_Sec-Pln",M2Y33C33="#Smt_Hm_Asst-Pln",M2Y33C34="#Smt_Hm_Dev-Pln",M3Q241C45="Fin_App_Trst",ECYBASPOP="Total_Pop")
df.hex4<-rename_columns(df.hex4,new_names)
colnames(df.hex4)
 [1] "CODE"                   "Confidence"             "Tech Enthu."           
 [4] "Total_Pop"              "Pop_Age"                "M_Pop_Age"             
 [7] "F_Pop_Age"              "#_Pvt_HHld"             "#_Fam_Hhld"            
[10] "#_NF_Hhld"              "Avg_Chld_PrCen_Fm_Hhld" "Houses"                
[13] "Apt"                    "Condos"                 "Hhld Income"           
[16] "High School"            "Clg_CEGEP_Non_Uni_Dip"  "Uni_Dip_Blw_Bach"      
[19] "Empl"                   "UnEmp"                  "Tot_Mig"               
[22] "Ann_Prem_Hm_Insu"       "Mob_Mark_cons"          "Mob_Inf_sec_cons"      
[25] "# Smart_purch"          "#Smt_Therm"             "#Smt_Hm_Secu"          
[28] "#Smt_Hm_Ass"            "Smt_Hm_Dev"             "#Smat_Therm-Pln"       
[31] "#Smt_Hm_Sec-Pln"        "#Smt_Hm_Asst-Pln"       "#Smt_Hm_Dev-Pln"       
[34] "Fin_App_Trst"          
#rm(fs2)

Now, we are rearranging the column orders

col_order <- c("Confidence","Total_Pop","Tech Enthu.", "Ann_Prem_Hm_Insu", "Mob_Mark_cons","Mob_Inf_sec_cons","# Smart_purch","#Smt_Therm","#Smt_Hm_Secu","#Smt_Hm_Ass","Smt_Hm_Dev","#Smat_Therm-Pln","#Smt_Hm_Sec-Pln","#Smt_Hm_Asst-Pln","#Smt_Hm_Dev-Pln","Fin_App_Trst","CODE","Pop_Age","M_Pop_Age","F_Pop_Age","#_Pvt_HHld","#_Fam_Hhld","#_NF_Hhld","Avg_Chld_PrCen_Fm_Hhld","Houses","Apt","Condos","Hhld Income","High School","Clg_CEGEP_Non_Uni_Dip","Uni_Dip_Blw_Bach","Empl","UnEmp","Tot_Mig")
df.hex4 <- df.hex4[, col_order]
head(df.hex4)
  Confidence Total_Pop Tech Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1  2917.6303     11875   1658.5059        2636.2583       2102.76
2   555.0760      3339    475.3049        1025.1245        469.62
3   769.6208      3327    986.1495         746.9732        352.56
4   430.9621      2474    369.0277         611.4645        386.08
5   390.9832      1973    403.9488         528.4261        270.98
6  2109.9184     10258   2077.7702        2450.7502       1112.58
  Mob_Inf_sec_cons # Smart_purch #Smt_Therm #Smt_Hm_Secu #Smt_Hm_Ass Smt_Hm_Dev
1          1529.28       1529.28    1911.60       477.90     1816.02    1146.96
2           443.53        417.44     704.43       260.90      730.52     443.53
3           406.80        298.32     488.16       162.72      542.40     244.08
4           365.76        203.20     548.64       142.24      508.00     264.16
5           302.86        270.98     302.86        95.64      462.26     223.16
6          1748.34       1033.11    1907.28      1271.52     1827.81    1350.99
  #Smat_Therm-Pln #Smt_Hm_Sec-Pln #Smt_Hm_Asst-Pln #Smt_Hm_Dev-Pln Fin_App_Trst
1         1433.70          764.64           955.80          955.80      4970.16
2          234.81          286.99           208.72          260.90      1330.59
3          271.20          352.56           162.72          216.96      1328.88
4          243.84          223.52           203.20          162.56       995.68
5          175.34          207.22           111.58          127.52       828.88
6          874.17          715.23           476.82          715.23      3894.03
    CODE Pop_Age M_Pop_Age F_Pop_Age #_Pvt_HHld #_Fam_Hhld #_NF_Hhld
1 K0A1A0    46.1      44.9      47.7        2.6       3386      1064
2 K0A1B0    46.0      45.5      46.4        2.4       1143       220
3 K0A1E0    43.5      43.3      43.9        3.2        763       247
4 K0A1G0    43.9      43.1      45.1        3.0        626       187
5 K0A1K0    44.5      44.1      45.0        2.3        645       202
6 K0A1L0    40.9      39.9      41.8        2.8       3079       631
  Avg_Chld_PrCen_Fm_Hhld Houses Apt Condos Hhld Income High School
1                    0.9   3921 516    122    95077.29        2640
2                    1.0   1355   5      1   129215.03         753
3                    1.0    883 118     15    88144.35         860
4                    1.0    772  38      4   105583.46         582
5                    0.9    782  63     40    98552.99         490
6                    1.1   3491 216    126   146759.85        2055
  Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach Empl UnEmp Tot_Mig
1                  2926             1988 6045   174     851
2                   791              651 1817   115     305
3                   762              448 1839    98     161
4                   580              483 1315    59     129
5                   467              337 1036    64     270
6                  2156             2904 5805   248    1477
colnames(df.hex4)
 [1] "Confidence"             "Total_Pop"              "Tech Enthu."           
 [4] "Ann_Prem_Hm_Insu"       "Mob_Mark_cons"          "Mob_Inf_sec_cons"      
 [7] "# Smart_purch"          "#Smt_Therm"             "#Smt_Hm_Secu"          
[10] "#Smt_Hm_Ass"            "Smt_Hm_Dev"             "#Smat_Therm-Pln"       
[13] "#Smt_Hm_Sec-Pln"        "#Smt_Hm_Asst-Pln"       "#Smt_Hm_Dev-Pln"       
[16] "Fin_App_Trst"           "CODE"                   "Pop_Age"               
[19] "M_Pop_Age"              "F_Pop_Age"              "#_Pvt_HHld"            
[22] "#_Fam_Hhld"             "#_NF_Hhld"              "Avg_Chld_PrCen_Fm_Hhld"
[25] "Houses"                 "Apt"                    "Condos"                
[28] "Hhld Income"            "High School"            "Clg_CEGEP_Non_Uni_Dip" 
[31] "Uni_Dip_Blw_Bach"       "Empl"                   "UnEmp"                 
[34] "Tot_Mig"               

As some of the variables represent total households/population in a Postal Code, we are dividing those variables by the total population to get a common unit of each values as “per person”.

df.hex4$`Tech Enthu.`=df.hex4$`Tech Enthu.`/df.hex4$Total_Pop
df.hex4$Ann_Prem_Hm_Insu=df.hex4$Ann_Prem_Hm_Insu/df.hex4$Total_Pop
df.hex4$Mob_Mark_cons=df.hex4$Mob_Mark_cons/df.hex4$Total_Pop
df.hex4$Mob_Inf_sec_cons=df.hex4$Mob_Inf_sec_cons/df.hex4$Total_Pop
df.hex4$Smt_Hm_Dev=df.hex4$Smt_Hm_Dev/df.hex4$Total_Pop
df.hex4$Fin_App_Trst=df.hex4$Fin_App_Trst/df.hex4$Total_Pop
df.hex4$Houses=df.hex4$Houses/df.hex4$Total_Pop
df.hex4$Apt=df.hex4$Apt/df.hex4$Total_Pop
df.hex4$Condos=df.hex4$Condos/df.hex4$Total_Pop
df.hex4$`Hhld Income`=df.hex4$`Hhld Income`/df.hex4$Total_Pop
df.hex4$`High School`=df.hex4$`High School`/df.hex4$Total_Pop
df.hex4$Clg_CEGEP_Non_Uni_Dip=df.hex4$Clg_CEGEP_Non_Uni_Dip/df.hex4$Total_Pop
df.hex4$Uni_Dip_Blw_Bach=df.hex4$Uni_Dip_Blw_Bach/df.hex4$Total_Pop
df.hex4$Empl=df.hex4$Empl/df.hex4$Total_Pop
df.hex4$UnEmp=df.hex4$UnEmp/df.hex4$Total_Pop
df.hex4$Tot_Mig=df.hex4$Tot_Mig/df.hex4$Total_Pop
df.hex4$`# Smart_purch`=df.hex4$`# Smart_purch`/df.hex4$Total_Pop
df.hex4$`#Smt_Therm`=df.hex4$`#Smt_Therm`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Secu`=df.hex4$`#Smt_Hm_Secu`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Ass`=df.hex4$`#Smt_Hm_Ass`/df.hex4$Total_Pop
df.hex4$`#Smat_Therm-Pln`=df.hex4$`#Smat_Therm-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Sec-Pln`=df.hex4$`#Smt_Hm_Sec-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Asst-Pln`=df.hex4$`#Smt_Hm_Asst-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Dev-Pln`=df.hex4$`#Smt_Hm_Dev-Pln`/df.hex4$Total_Pop
df.hex4$Confidence=df.hex4$Confidence/df.hex4$Total_Pop
head(df.hex4)
  Confidence Total_Pop Tech Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1  0.2456952     11875   0.1396637        0.2220007     0.1770745
2  0.1662402      3339   0.1423495        0.3070154     0.1406469
3  0.2313258      3327   0.2964080        0.2245186     0.1059693
4  0.1741965      2474   0.1491624        0.2471562     0.1560550
5  0.1981669      1973   0.2047384        0.2678287     0.1373441
6  0.2056852     10258   0.2025512        0.2389111     0.1084597
  Mob_Inf_sec_cons # Smart_purch #Smt_Therm #Smt_Hm_Secu #Smt_Hm_Ass Smt_Hm_Dev
1        0.1287815    0.12878147  0.1609768   0.04024421   0.1529280 0.09658611
2        0.1328332    0.12501947  0.2109704   0.07813717   0.2187841 0.13283318
3        0.1222723    0.08966637  0.1467268   0.04890893   0.1630298 0.07336339
4        0.1478416    0.08213420  0.2217623   0.05749394   0.2053355 0.10677445
5        0.1535023    0.13734415  0.1535023   0.04847440   0.2342930 0.11310694
6        0.1704367    0.10071261  0.1859310   0.12395399   0.1781839 0.13170111
  #Smat_Therm-Pln #Smt_Hm_Sec-Pln #Smt_Hm_Asst-Pln #Smt_Hm_Dev-Pln Fin_App_Trst
1      0.12073263      0.06439074       0.08048842      0.08048842    0.4185398
2      0.07032345      0.08595088       0.06250973      0.07813717    0.3984996
3      0.08151488      0.10596934       0.04890893      0.06521190    0.3994229
4      0.09856103      0.09034762       0.08213420      0.06570736    0.4024576
5      0.08886974      0.10502788       0.05655347      0.06463254    0.4201115
6      0.08521837      0.06972412       0.04648275      0.06972412    0.3796091
    CODE Pop_Age M_Pop_Age F_Pop_Age #_Pvt_HHld #_Fam_Hhld #_NF_Hhld
1 K0A1A0    46.1      44.9      47.7        2.6       3386      1064
2 K0A1B0    46.0      45.5      46.4        2.4       1143       220
3 K0A1E0    43.5      43.3      43.9        3.2        763       247
4 K0A1G0    43.9      43.1      45.1        3.0        626       187
5 K0A1K0    44.5      44.1      45.0        2.3        645       202
6 K0A1L0    40.9      39.9      41.8        2.8       3079       631
  Avg_Chld_PrCen_Fm_Hhld    Houses         Apt       Condos Hhld Income
1                    0.9 0.3301895 0.043452632 0.0102736842    8.006509
2                    1.0 0.4058101 0.001497454 0.0002994909   38.698721
3                    1.0 0.2654043 0.035467388 0.0045085663   26.493643
4                    1.0 0.3120453 0.015359741 0.0016168149   42.677227
5                    0.9 0.3963507 0.031931069 0.0202736949   49.950831
6                    1.1 0.3403198 0.021056736 0.0122830961   14.306868
  High School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach      Empl      UnEmp
1   0.2223158             0.2464000        0.1674105 0.5090526 0.01465263
2   0.2255166             0.2368973        0.1949686 0.5441749 0.03444145
3   0.2584911             0.2290352        0.1346558 0.5527502 0.02945597
4   0.2352466             0.2344382        0.1952304 0.5315279 0.02384802
5   0.2483528             0.2366954        0.1708059 0.5250887 0.03243791
6   0.2003314             0.2101774        0.2830961 0.5658998 0.02417625
     Tot_Mig
1 0.07166316
2 0.09134471
3 0.04839194
4 0.05214228
5 0.13684744
6 0.14398518
h2o.describe(as.h2o(df.hex4 ))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
                    Label   Type Missing  Zeros PosInf NegInf Min
1              Confidence   real   30400   2229      0      0   0
2               Total_Pop    int       0  30400      0      0   0
3             Tech Enthu.   real   30400   2229      0      0   0
4        Ann_Prem_Hm_Insu   real   30400   2229      0      0   0
5           Mob_Mark_cons   real   30400   2248      0      0   0
6        Mob_Inf_sec_cons   real   30400   2248      0      0   0
7           # Smart_purch   real   30400   2248      0      0   0
8              #Smt_Therm   real   30400   2248      0      0   0
9            #Smt_Hm_Secu   real   30400   2353      0      0   0
10            #Smt_Hm_Ass   real   30400   2248      0      0   0
11             Smt_Hm_Dev   real   30400   2248      0      0   0
12        #Smat_Therm-Pln   real   30400   2251      0      0   0
13        #Smt_Hm_Sec-Pln   real   30400   2270      0      0   0
14       #Smt_Hm_Asst-Pln   real   30400   2271      0      0   0
15        #Smt_Hm_Dev-Pln   real   30400   2251      0      0   0
16           Fin_App_Trst   real   30400   2248      0      0   0
17                   CODE string       0      0      0      0 NaN
18                Pop_Age   real       0  30400      0      0   0
19              M_Pop_Age   real       0  33420      0      0   0
20              F_Pop_Age   real       0  49473      0      0   0
21             #_Pvt_HHld   real       0  32629      0      0   0
22             #_Fam_Hhld    int       0  33957      0      0   0
23              #_NF_Hhld    int       0  61796      0      0   0
24 Avg_Chld_PrCen_Fm_Hhld   real       0  37183      0      0   0
25                 Houses   real   30400   7985      0      0   0
26                    Apt   real   30400 107710      0      0   0
27                 Condos   real   30400 186082      0      0   0
28            Hhld Income   real   30400   2229      0      0   0
29            High School   real   30400   6110      0      0   0
30  Clg_CEGEP_Non_Uni_Dip   real   30400  10660      0      0   0
31       Uni_Dip_Blw_Bach   real   30400  20955      0      0   0
32                   Empl   real   30400   2404      0      0   0
33                  UnEmp   real   30400 102846      0      0   0
34                Tot_Mig   real   30400  23913      0      0   0
               Max          Mean          Sigma Cardinality
1        0.4615281    0.22364379     0.05600671          NA
2    17176.0000000   51.39055065   184.71751083          NA
3        0.4084117    0.21796779     0.05953486          NA
4        0.8051892    0.21656763     0.07575771          NA
5        0.3200000    0.13348765     0.03520981          NA
6        0.4500000    0.18203466     0.04390049          NA
7        0.4947183    0.14579603     0.05531290          NA
8        0.5075000    0.18677906     0.06775226          NA
9        0.3200000    0.08309306     0.04106577          NA
10       0.4800000    0.20590744     0.05357666          NA
11       0.3794595    0.11621416     0.04169025          NA
12       0.2520000    0.09426797     0.03153806          NA
13       0.3208333    0.11155460     0.04268567          NA
14       0.2818182    0.07922398     0.03198151          NA
15       0.3051163    0.09788840     0.03487279          NA
16       0.7200000    0.43228969     0.07519694          NA
17             NaN            NA             NA          NA
18      90.2000000   38.46463833    16.20522133          NA
19      90.2000000   37.45517321    16.81544034          NA
20      90.2000000   36.64113724    20.61625165          NA
21      15.0000000    2.71663221     1.54614338          NA
22    5659.0000000   13.59636477    53.28873497          NA
23    1450.0000000    5.87630618    22.65970172          NA
24      11.3000000    1.08344383     0.68519524          NA
25       1.0000000    0.27825254     0.12110564          NA
26       1.0000000    0.08330559     0.14144898          NA
27       1.0000000    0.02688702     0.07406307          NA
28 1503523.0100000 6903.10433236 15586.33001756          NA
29       1.0000000    0.23838485     0.08775688          NA
30       1.0000000    0.18352363     0.07958586          NA
31       1.0000000    0.23445186     0.15831378          NA
32       1.0000000    0.51589893     0.15144526          NA
33       1.0000000    0.03369596     0.04612460          NA
34       1.0000000    0.24478782     0.18853522          NA

Writing the combined data frame to a file so that the work above does not need to be repeated.

Note: The final csv generated here contains a few rows with 0 values in all the columns. So, we have deleted those few rows manually and included the final “Project_Merged_File” in the zip document itself.

#h2o.exportFile(as.h2o(df.hex4), "C:/Jaspreet Marketing Analytics Project/Project_Merged_File.csv", force=TRUE)

6.5 Segmentation Analysis

6.5.1 Reading in the Data

Initially we started with reading in the variables from our CSV File.

fs1 <- read.csv("C:/Jaspreet Marketing Analytics Project/Project_Merged_File.csv")

6.5.2 Missing Values : XRAY

One of the most difficult realities of working with data is missing values. Through using the xray package, there were no missing values to deal with.

xray::anomalies(fs1)
$variables
                 Variable      q qNA pNA  qZero  pZero qBlank pBlank qInf pInf
1                  Condos 252726   0   - 183834 72.74%      0      -    0    -
2                     Apt 252726   0   - 105463 41.73%      0      -    0    -
3                   UnEmp 252726   0   - 100598 39.81%      0      -    0    -
4              X._NF_Hhld 252726   0   -  29148 11.53%      0      -    0    -
5                 Tot_Mig 252726   0   -  21665  8.57%      0      -    0    -
6               F_Pop_Age 252726   0   -  19015  7.52%      0      -    0    -
7        Uni_Dip_Blw_Bach 252726   0   -  18712   7.4%      0      -    0    -
8   Clg_CEGEP_Non_Uni_Dip 252726   0   -   8413  3.33%      0      -    0    -
9                  Houses 252726   0   -   5755  2.28%      0      -    0    -
10 Avg_Chld_PrCen_Fm_Hhld 252726   0   -   4553   1.8%      0      -    0    -
11            High.School 252726   0   -   3863  1.53%      0      -    0    -
12              M_Pop_Age 252726   0   -   2927  1.16%      0      -    0    -
13            X._Fam_Hhld 252726   0   -   1328  0.53%      0      -    0    -
14                   Empl 252726   0   -    172  0.07%      0      -    0    -
15          X.Smt_Hm_Secu 252726   0   -    105  0.04%      0      -    0    -
16      X.Smt_Hm_Asst.Pln 252726   0   -     23  0.01%      0      -    0    -
17       X.Smt_Hm_Sec.Pln 252726   0   -     22  0.01%      0      -    0    -
18       X.Smat_Therm.Pln 252726   0   -      3     0%      0      -    0    -
19       X.Smt_Hm_Dev.Pln 252726   0   -      3     0%      0      -    0    -
20            X._Pvt_HHld 252726   0   -      0      -      0      -    0    -
21                Pop_Age 252726   0   -      0      -      0      -    0    -
22              Total_Pop 252726   0   -      0      -      0      -    0    -
23             Confidence 252726   0   -      0      -      0      -    0    -
24            Tech.Enthu. 252726   0   -      0      -      0      -    0    -
25       Ann_Prem_Hm_Insu 252726   0   -      0      -      0      -    0    -
26          Mob_Mark_cons 252726   0   -      0      -      0      -    0    -
27       Mob_Inf_sec_cons 252726   0   -      0      -      0      -    0    -
28             Smt_Hm_Dev 252726   0   -      0      -      0      -    0    -
29           Fin_App_Trst 252726   0   -      0      -      0      -    0    -
30           X.Smt_Hm_Ass 252726   0   -      0      -      0      -    0    -
31         X..Smart_purch 252726   0   -      0      -      0      -    0    -
32            X.Smt_Therm 252726   0   -      0      -      0      -    0    -
33            Hhld.Income 252726   0   -      0      -      0      -    0    -
34                   CODE 252726   0   -      0      -      0      -    0    -
   qDistinct      type anomalous_percent
1       7567   Numeric            72.74%
2      10390   Numeric            41.73%
3       5136   Numeric            39.81%
4        479   Integer            11.53%
5      11811   Numeric             8.57%
6        723   Numeric             7.52%
7      10293   Numeric              7.4%
8       7046   Numeric             3.33%
9       8719   Numeric             2.28%
10        48   Numeric              1.8%
11      7634   Numeric             1.53%
12       668   Numeric             1.16%
13       804   Integer             0.53%
14     10525   Numeric             0.07%
15     48765   Numeric             0.04%
16     44903   Numeric             0.01%
17     48552   Numeric             0.01%
18     44468   Numeric                0%
19     45573   Numeric                0%
20       104   Numeric                 -
21       670   Numeric                 -
22      1608   Integer                 -
23     39750   Numeric                 -
24     39750   Numeric                 -
25     43899   Numeric                 -
26     44181   Numeric                 -
27     45403   Numeric                 -
28     48251   Numeric                 -
29     48628   Numeric                 -
30     49279   Numeric                 -
31     49827   Numeric                 -
32     54767   Numeric                 -
33    140106   Numeric                 -
34    252726 Character                 -

$problem_variables
 [1] Variable          q                 qNA               pNA              
 [5] qZero             pZero             qBlank            pBlank           
 [9] qInf              pInf              qDistinct         type             
[13] anomalous_percent problems         
<0 rows> (or 0-length row.names)

6.6 Level 1 Segmentation

For the segmentation analyses, K-means has been used.

Level 1 segmentation was for attitude variables and they were predominately tehnology focused. There was also one variable about insurance premiums and another about confidence in big businesses.

#library(h2o)
#h2o.init()

The dataset that is now imported in the h20 cluster, still has all correctly recoded variables.

fs.class<-fs1[-c(2)]
fs.class <- as.h2o(fs.class)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
str(fs.class)
Class 'H2OFrame' <environment: 0x000000002910d028> 
 - attr(*, "op")= chr "Parse"
 - attr(*, "id")= chr "fs.class_sid_abe7_21"
 - attr(*, "eval")= logi FALSE
 - attr(*, "nrow")= int 252726
 - attr(*, "ncol")= int 33
 - attr(*, "types")=List of 33
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "string"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "int"
  ..$ : chr "int"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "real"
 - attr(*, "data")='data.frame':    10 obs. of  33 variables:
  ..$ Confidence            : num  0.00943 0.02595 0.02524 0.03188 0.03043 ...
  ..$ Tech.Enthu.           : num  0.00974 0.01475 0.02987 0.02482 0.02848 ...
  ..$ Ann_Prem_Hm_Insu      : num  0.0044 0.00587 0.02074 0.0181 0.00905 ...
  ..$ Mob_Mark_cons         : num  0.00565 0.01267 0.01556 0.01176 0.01877 ...
  ..$ Mob_Inf_sec_cons      : num  0.00753 0.02139 0.02556 0.02706 0.02765 ...
  ..$ X..Smart_purch        : num  0.00715 0.01426 0.01889 0.02118 0.02469 ...
  ..$ X.Smt_Therm           : num  0.00979 0.02218 0.02889 0.02824 0.0237 ...
  ..$ X.Smt_Hm_Secu         : num  0.00452 0.00713 0.01222 0.00941 0.01185 ...
  ..$ X.Smt_Hm_Ass          : num  0.012 0.0253 0.0322 0.0353 0.0267 ...
  ..$ Smt_Hm_Dev            : num  0.00715 0.01188 0.01889 0.02 0.01877 ...
  ..$ X.Smat_Therm.Pln      : num  0.00452 0.01109 0.01111 0.01647 0.01778 ...
  ..$ X.Smt_Hm_Sec.Pln      : num  0.00791 0.01267 0.01889 0.01412 0.02173 ...
  ..$ X.Smt_Hm_Asst.Pln     : num  0.00264 0.00554 0.01111 0.01529 0.01284 ...
  ..$ X.Smt_Hm_Dev.Pln      : num  0.00414 0.01109 0.00889 0.01294 0.0158 ...
  ..$ Fin_App_Trst          : num  0.0188 0.0451 0.0567 0.0659 0.0583 ...
  ..$ CODE                  : chr  "K7L5R2" "K9V6H4" "N2L2G4" "N2L2J2" ...
  ..$ Pop_Age               : num  41.8 32.9 84.3 85 83.8 76.8 81.2 71.8 37.5 77
  ..$ M_Pop_Age             : num  42.1 32.3 83.6 86.1 76.9 74 82.6 70.9 35 42.2
  ..$ F_Pop_Age             : num  15 37.5 85 83.5 86 78.1 80 72.7 67.5 80.5
  ..$ X._Pvt_HHld           : num  6 11 7 5 14 3.6 2 7 2 9
  ..$ X._Fam_Hhld           : num  2 1 1 1 1 5 1 1 1 1
  ..$ X._NF_Hhld            : num  1 0 0 0 0 9 0 0 0 0
  ..$ Avg_Chld_PrCen_Fm_Hhld: num  3 3 2 1 3 2.4 1 1 1 2
  ..$ Houses                : num  0.00235 0.0099 0.02778 0.02941 0.01235 ...
  ..$ Apt                   : num  0.00235 0 0 0 0 ...
  ..$ Condos                : num  0.00235 0 0 0 0 ...
  ..$ Hhld.Income           : num  128 906 2407 1528 1765 ...
  ..$ High.School           : num  0.00706 0.0396 0 0.02941 0.04938 ...
  ..$ Clg_CEGEP_Non_Uni_Dip : num  0.00941 0.0198 0 0.02941 0.01235 ...
  ..$ Uni_Dip_Blw_Bach      : num  0.0141 0 0.0833 0.0294 0.0123 ...
  ..$ Empl                  : num  0.0118 0.0495 0.0556 0.0882 0.0864 ...
  ..$ UnEmp                 : num  0.00471 0 0.02778 0.02941 0 ...
  ..$ Tot_Mig               : num  0.0235 0.0099 0.0278 0.0294 0.0741 ...

6.6.1 A 2 Segment Solution

Level 1 segmentation begins with a two-cluster solution and then moves ahead to 3 segments and so on until useful options were been identified. Two segments is just a first guess.

The dfK variable below holds the sums of squares for each segmentation solution. One key objective of segmentation is to maximize the differences among segments while having individuals within each segment to be very close. That translates into maximizing the between segments sum-of-squares and minimizing the within segments sums-of-squares.

dfK <- data.frame(Numbr_Segments=numeric(),
                  TotWithinSS=numeric(),
                  BetweenSS=numeric(),
                  TotSS=numeric(),
                  stringsAsFactors=FALSE) 

This code-chunk conducts the k-means analysis in h2o on all attitude variables. Using the variables, we wanted to see if they could help us partition the dataset into sections or clusters that are distinct from one another.

k_iterations = 2  
fs.km2 <- h2o.kmeans(fs.class[, 1:15], k = k_iterations, estimate_k = TRUE, 
                   init=c("PlusPlus"), standardize= FALSE,
                   score_each_iteration= FALSE, seed = 7238,
                   keep_cross_validation_predictions=TRUE)

The following code-chunk saves key information from this solution in the dfK receptacle so that a comparative table can be produced when all rounds of partitioning are finished. The dfk object holds the number of iterations ‘within sum of squares’ (which we’re looking at minimizing) the ‘between sum of squares (which we’re looking at maximizing) and the ’total sum of squares’ (which will always be the same).

dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km2) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km2)   # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km2)       # total sums of squares

Out of this two-segment k means (i.e. the segmentation clusters), there is 96002in one cluster and 156724 in the second cluster. They are uneven, but it’s nothing to be concerned about. With additial segments, hopefully they evenout more.

h2o.cluster_sizes(fs.km2) # sizes of the segmentation clusters 
[1]  96002 156724

The following presents only cluster means, with centroid numbering for the variables. For example, the mean for variable “confidence” for segment 1 is 0.252506 and for segment 2 is 0.209111.

fs.km2@model$centers   
Cluster Means: 
  centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1        1   0.252506  0.251570         0.189596      0.147918         0.196269
2        2   0.209111  0.200449         0.236151      0.126548         0.175912
  xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1     0.184659   0.243259     0.117882    0.247712   0.150620       0.110301
2     0.124040   0.154800     0.062937    0.183209   0.096768       0.085782
  xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1       0.145282        0.098794       0.120587     0.448208
2       0.092458        0.068352       0.085364     0.428726

Using ggplot, a plot of the centroids for the 2 segments over all used variables was created. You’ll notice that there is no ‘layering’ between the two segments in the centroid means plot. There is some cross over, which helps with describing persona profiles. We know that further segmentation willl yield better results, therefore we have stopped investing the 2 segment solution and have moved on to 3 segment clustering below this plot.

library(reshape2) # it's necessary to reshape the data into 'long' format
fs2c_long<- melt(fs.km2@model$centers ) # need to reshape to 'long' form
library(ggplot2)
ggplot(data=fs2c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (0-0.45)") + 
  ylim(0, 0.45) +
  annotate("text", x = 3, y = 1, label = "Base = 1036", size=4)  

6.6.2 A 3 segment solution

This code-chunk just changes the k-means command to 3 iterations, i.e., 3 segments. fs.km3 is the solution for 3 segments.

k_iterations = 3 
fs.km3 <- h2o.kmeans( fs.class[,1:15], k=k_iterations, estimate_k = TRUE, 
                   init=c("PlusPlus"), standardize= FALSE,
                   score_each_iteration= FALSE, seed = 7238,
                   keep_cross_validation_predictions=TRUE )

The sizes of the three segmentation clusters are below. The sizes are now a lot more even at : 81147, 81866, and 89713.

dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km3) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km3)   # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km3)       # total sums of squares

Sizes of the segmentation clusters & cluster means, with centroid numbering is below.

h2o.cluster_sizes(fs.km3) # sizes of the segmentation clusters
[1] 81147 81866 89713
fs.km3@model$centers
Cluster Means: 
  centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1        1   0.256710  0.253889         0.179860      0.148702         0.195111
2        2   0.199922  0.196589         0.189364      0.122409         0.167667
3        3   0.220600  0.210086         0.279686      0.133024         0.187692
  xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1     0.188097   0.250251     0.123119    0.250546   0.153713       0.110484
2     0.122583   0.143535     0.060426    0.170710   0.089837       0.079019
3     0.132027   0.172913     0.069328    0.202337   0.108916       0.095679
  xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1       0.149757        0.101535       0.123214     0.441721
2       0.086693        0.065728       0.078356     0.396296
3       0.102135        0.073150       0.094995     0.467023

The following plot of the centroids for the 3 segments is slightly more interesting than was the 2-segment solution but we can still need to push further. Some initial observations:

  • Segment 1 are the technophiles. They own a lot of smart technology and also plan on purchasing more. This segment is high on having enthusiasm for technology and are trusting of big businesses.
  • Segment 2 are the least technologically-enabled out of the three segments. They do, however pay higher premiums for home insurance than Segment 1, the technophiles.
  • Segment 3 is interesting because they pay the highest home insurance premiums by a long shot, but don’t own a lot of smart home technology. It will be interesting to see the level 2 segmentation, because these individuals may be older, have more money and thus larger more expensive houses that have higher premiums, but because they’re older, they may be somewhat technologically-disabled.

Below the plot, we have decided to push further with a 4-Segment Solution.

fs3c_long<- melt(fs.km3@model$centers ) # need to reshape to 'long' form
 
ggplot(data=fs3c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (0-0.45)") + ylim(0,0.45) +
  annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) 

6.6.3 A 4 segment solution

As usual, the clustering algorithm groups the respondents.

#____k-means clustering, k=4 clusters
k_iterations <- 4 # MAXIMUM NUMBER OF SEGMENTS USED HERE
fs.km4 <- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE, 
                    init=c("PlusPlus"), standardize= FALSE,
                    score_each_iteration= FALSE, seed = 7238,
                    keep_cross_validation_predictions=TRUE) 

Saving the sums-of-squares to the dfK data frame.

dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km4) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km4)   # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km4)       # total sums of squares 

The sizes of the four clusters are getting a little uneven again, but it’s nothing to be concerned about: 41569, 76774, 62595, and 71788.

h2o.cluster_sizes(fs.km4) # 
[1] 41569 76774 62595 71788

The cluster means, with centroid numbering, are provided below.

fs.km4@model$centers  
Cluster Means: 
  centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1        1   0.201426  0.258592         0.161547      0.144135         0.208495
2        2   0.204561  0.181748         0.218850      0.117299         0.158659
3        3   0.267740  0.249957         0.179796      0.149719         0.190807
4        4   0.225678  0.212453         0.286128      0.134919         0.190283
  xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1     0.190667   0.147765     0.075679    0.193768   0.105557       0.093433
2     0.107581   0.150261     0.058678    0.171162   0.089400       0.078181
3     0.184109   0.265848     0.131476    0.258052   0.160900       0.111391
4     0.132095   0.185768     0.074036    0.211663   0.116090       0.100329
  xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1       0.098799        0.080699       0.101414     0.462595
2       0.086462        0.062731       0.075302     0.392077
3       0.159152        0.105840       0.127678     0.431884
4       0.108018        0.075420       0.097391     0.473106

This plot of the centroids for the 4 segments shows much greater complexity than did the earlier solutions. Some observations:

  • Segment 1 is the second most important group for RBC to target, based on their enthusiasm for technology, smarthome technology adoption, and plans to purchase more smart home technology. The are not extreme technophiles like Segment 3, but are the second most technologically-enabled segment.
  • Segment 2 is the lowest priority segment to market to given the context of RBCI’s goal of smart technology partnership and marketing.They have the lowest adoption of smart home technology and the lowest plans to purchase the technology in the future. this segment could in fact encompass an elderly population, but this of course would need to be confirmed in the level 2 segmentation.
  • Segment 3 is now the technophile group, with the most amount of confidence in big businesses and
  • Segment 4 is now the group that pays high insurance premiums, but is slow to smart home technology adoption.

Before looking at comparisons of the sums-of-squares, we have looked into a 5-segment solution.

library(reshape2) # it's necessary to reshape the data into 'long' format
fs4c_long<- melt(fs.km4@model$centers ) # need to reshape to 'long' form

library(ggplot2)
ggplot(data=fs4c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (0-0.5)") + ylim(0, 0.5) +
  annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) 

### A 5 segment solution

As usual, the clustering algorithm groups the respondents.

k_iterations = 5
fs.km5<- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE, 
                   init=c("PlusPlus"), standardize= FALSE,
                   score_each_iteration= FALSE, seed = 7238,
                   keep_cross_validation_predictions=TRUE) 

Saving the sums-of-squares to the dfk data frame

dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km5) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km5)   # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km5)       # total sums of squares

The sizes of the five clusters are getting a little uneven again, but it’s nothing to be concerned about: 28435 55811 37787 68582 and 62111.

h2o.cluster_sizes(fs.km5) # 
[1] 28435 55811 37787 68582 62111

The Cluster means with centroid numbering are below.

fs.km5@model$centers   
Cluster Means: 
  centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1        1   0.196001  0.256632         0.159647      0.149405         0.217220
2        2   0.226635  0.207035         0.302774      0.133919         0.189772
3        3   0.286595  0.254967         0.169591      0.157873         0.193669
4        4   0.202890  0.180364         0.214557      0.116860         0.157771
5        5   0.227482  0.237932         0.202917      0.134659         0.185620
  xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1     0.200842   0.125183     0.067929    0.185842   0.099148       0.089929
2     0.123966   0.171386     0.066838    0.202734   0.106343       0.097778
3     0.196887   0.279565     0.146351    0.271448   0.169732       0.117141
4     0.106023   0.146765     0.057151    0.168893   0.087195       0.077218
5     0.159467   0.225617     0.099206    0.227997   0.137927       0.102015
  xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1       0.091335        0.080756       0.103926     0.476561
2       0.103691        0.074386       0.097439     0.476269
3       0.174145        0.117362       0.141742     0.432556
4       0.084650        0.061744       0.074103     0.389897
5       0.124923        0.082604       0.099543     0.435099

This plot of the centroids for the 5 segments shows much greater complexity than did the earlier solutions. Perhaps too much complexity. The segments are getting harder to differentiate because there is so much overlap.

library(reshape2) # it's necessary to reshape the data into 'long' format
fs5c_long<- melt(fs.km5@model$centers ) # need to reshape to 'long' form

library(ggplot2)
ggplot(data=fs5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (0-0.6)") + 
  ylim(0,0.6) +
  annotate("text", x = 3, y = 1, label = "Base = 1036", size=4)  

Now let’s look at comparison among the sums-of-squares information to help determine the number of segments to investigate further. The 5-segment solution has the highest between segment sums-of-squares.

pacman::p_load(sjPlot)
tab_df(dfK, sort.column = -3, show.rownames = FALSE, 
       title = "Sums of Squares for Level 1 Segmentation", digits=2)
Sums of Squares for Level 1 Segmentation
Numbr_Segments TotWithinSS BetweenSS TotSS
5 5433.62 3605.08 9038.71
4 5756.99 3281.72 9038.71
3 6258.32 2780.38 9038.71
2 6950.98 2087.72 9038.71

Before looking at comparisons of the sums-of-squares, we have looked into a 6-segment solution.

6.6.4 A 6 segment solution

As usual, the clustering algorithm groups the respondents.

k_iterations <- 6
fs.km6<- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE, 
                   init=c("PlusPlus"), standardize= FALSE,
                   score_each_iteration= FALSE, seed = 7238,
                   keep_cross_validation_predictions=TRUE) 

Saving the sums-of-squares to the dfK data frame.

dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km6) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km6)   # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km6)       # total sums of squares

The sizes of the six clusters are getting a little uneven again, but it’s nothing to be concerned about : 26635 58409 34745 61340 42733 & 28864

h2o.cluster_sizes(fs.km6) 
[1] 26635 58409 34745 61340 42733 28864

The cluster means with centroid numbering, are provided below..

fs.km6@model$centers 
Cluster Means: 
  centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1        1   0.191754  0.256654         0.160416      0.149149         0.217835
2        2   0.213186  0.208622         0.246539      0.132280         0.180940
3        3   0.281919  0.250519         0.176070      0.159439         0.196190
4        4   0.202426  0.178207         0.214629      0.115948         0.156440
5        5   0.245322  0.254933         0.179780      0.134451         0.183664
6        6   0.236315  0.207693         0.338468      0.137530         0.201611
  xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1     0.204414   0.124788     0.067997    0.186497   0.099902       0.088469
2     0.131895   0.195959     0.078413    0.214738   0.121044       0.098349
3     0.199055   0.287712     0.147283    0.277895   0.174777       0.116015
4     0.103886   0.141856     0.055007    0.165026   0.084133       0.075502
5     0.164899   0.224125     0.105452    0.223130   0.135033       0.104734
6     0.127743   0.160589     0.062986    0.198947   0.101528       0.097754
  xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1       0.089631        0.079154       0.103678     0.479893
2       0.105739        0.071653       0.090768     0.454770
3       0.173174        0.114121       0.141037     0.444190
4       0.082964        0.061023       0.073132     0.387059
5       0.135375        0.093152       0.107180     0.413531
6       0.104794        0.077433       0.102914     0.490498

This plot of the centroids for the 6 segments shows much greater complexity than did the earlier solutions. Perhaps too much complexity. The segments are getting harder to differentiate because there is so much overlap.

fs6c_long<- melt(fs.km6@model$centers ) # need to reshape to 'long' form

ggplot(data=fs6c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (0-0.55)") + ylim(0, 0.55) +
  annotate("text", x = 3, y = 1, label = "Base = 1036", size=4)  

Having done the 1 Segment to 6 Segment Solutions, we have plotted their sums-of-squares information to help determine the number of segments to investigate further. With the addition of each segment the WithinSS decreases, and the BetweenSS increases, which is what we want to happen. The 6 segment solution has the lowest WithinSS and the highest Between SS. However, the rate of change for the WithinSS and BetweenSS drastically declined for the 6 Segment Solution. The 6 Segment Solution’s plot of centroid means
was also difficult to create differentiated personas out of. The lines are not that separated from each other, thus the segments aren’t that different. The 4th segment plot seemed to be the best from an interpretation standpoint, and it was therefore futher investigated.

tab_df(dfK, sort.column = -3, show.rownames = FALSE, title = "Sums of Squares for Level 1 Segmentation", digits=2)
Sums of Squares for Level 1 Segmentation
Numbr_Segments TotWithinSS BetweenSS TotSS
6 5274.12 3764.59 9038.71
5 5433.62 3605.08 9038.71
4 5756.99 3281.72 9038.71
3 6258.32 2780.38 9038.71
2 6950.98 2087.72 9038.71

The graph below may provide a very rough guide for the number of segments, but is not definitive. The best number of segments may be where there is an elbow in the curve. An elbow seems to be placed at the ______ Segment Solution with diminishing returns (i.e. decreases) after it, but the 5-segment solution still increases the distances between segments and reduces the distances within segments.We know from the ggplot above that the 3 segment solution is not sufficient, which is why we stress that the ‘elbow’ guideline is simply a guideline

library(plotly)
dfK_melt <- melt(dfK, id=c("Numbr_Segments"))

p <- ggplot(data = dfK_melt, aes(x = Numbr_Segments, y = value)) +
  geom_point(aes(), size = 4) +
  geom_line(aes(colour = variable ), size=2) + facet_wrap(~ variable)+
  labs(x="Numbers of segments", y="Sums of squares") +
  theme(axis.title.x=element_text(size=16, face="bold", colour="blue")) +
  theme(axis.title.y=element_text(size=16, face="bold", colour="blue")) +
  theme(axis.text.x=element_text(size=14, face="bold" )) +
  theme(axis.text.y=element_text(size=14, face="bold" ))  +
  ggtitle("Cluster statistics") + theme(legend.position="none") 
(gg <- ggplotly(p)) 

6.6.5 Building Predictive Models That Predict Segment Accuracy

We have used Random Forest, GBM, logistic regression to predict segment accuracy obtained using K-means.

6.6.5.1 Segment (cluster) assignments.

These segment assignments are obtained by predicting each respondent’s segment based on fs.km4 results. Below are the predicted segment assignments for the 4-segment solution.

clusters.hex <- h2o.predict(fs.km4, fs.class) 
clusters.hex <- as.factor(clusters.hex) # cluster 'names' must be factors for modeling
clusters.hex$predict # segment assignments for first 6 respondents of 1036

And, this table shows the segment sizes printed earlier in a slightly different manner.

h2o.table(clusters.hex$predict) # table of assignments over all respondents 
  predict Count
1       0 41569
2       1 76774
3       2 62595
4       3 71788

[4 rows x 2 columns] 

Levels need to be set to “Seg1”,“Seg2”,“Seg3”,“Seg4” instead of 0, 1, 2, 3, 4.

h2o.setLevels(clusters.hex$predict, c("Seg1","Seg2","Seg3","Seg4"))
  predict
1    Seg2
2    Seg2
3    Seg2
4    Seg2
5    Seg2
6    Seg2

[252726 rows x 1 column] 
#__Bind RID, predicted segments, and fs variables
fs4C.class<- h2o.cbind(fs.class[,c(1)], clusters.hex$predict, fs.class[, c(2:33)])
fs4C.class <- fs4C.class[, c(1:34)] # reorganize the data frame

A table of assignments for all respondents follows.

h2o.table(clusters.hex$predict) 
  predict Count
1    Seg1 41569
2    Seg2 76774
3    Seg3 62595
4    Seg4 71788

[4 rows x 2 columns] 

These segment assignments and other file data have been saved in the following code-chunk.

h2o.exportFile(fs4C.class, 
    "C:/Jaspreet Marketing Analytics Project/FS_4.csv", force=TRUE) # save the results 

6.6.5.2 Splitting the sample for cross validation

fs.split<- h2o.splitFrame(fs4C.class, ratios=c(0.7) )
fs.split[[1]]  
   Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962    Seg2 0.009739573      0.004403856   0.005647059
2 0.025949382    Seg2 0.014750739      0.005865521   0.012673267
3 0.025238024    Seg2 0.029873820      0.020744546   0.015555556
4 0.030433902    Seg2 0.028479670      0.009048419   0.018765432
5 0.056125319    Seg2 0.061331941      0.063795120   0.030182927
6 0.076100931    Seg2 0.039830300      0.092729259   0.022857143
  Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1      0.007529412    0.007152941 0.009788235   0.004517647   0.01204706
2      0.021386139    0.014257426 0.022178218   0.007128713   0.02534653
3      0.025555556    0.018888889 0.028888889   0.012222222   0.03222222
4      0.027654321    0.024691358 0.023703704   0.011851852   0.02666667
5      0.028170732    0.024146341 0.042256098   0.012073171   0.03823171
6      0.028571429    0.012857143 0.018571429   0.005714286   0.02000000
   Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941      0.004517647      0.007905882       0.002635294
2 0.011881188      0.011089109      0.012673267       0.005544554
3 0.018888889      0.011111111      0.018888889       0.011111111
4 0.018765432      0.017777778      0.021728395       0.012839506
5 0.024146341      0.032195122      0.028170732       0.034207317
6 0.012857143      0.011428571      0.011428571       0.012857143
  X.Smt_Hm_Dev.Pln Fin_App_Trst   CODE Pop_Age M_Pop_Age F_Pop_Age X._Pvt_HHld
1      0.004141176   0.01882353 K7L5R2    41.8      42.1      15.0         6.0
2      0.011089109   0.04514852 K9V6H4    32.9      32.3      37.5        11.0
3      0.008888889   0.05666667 N2L2G4    84.3      83.6      85.0         7.0
4      0.015802469   0.05827160 M6B2P9    83.8      76.9      86.0        14.0
5      0.030182927   0.09054878 N7L5H5    76.8      74.0      78.1         3.6
6      0.010000000   0.07571429 K6H6X7    81.2      82.6      80.0         2.0
  X._Fam_Hhld X._NF_Hhld Avg_Chld_PrCen_Fm_Hhld      Houses         Apt
1           2          1                    3.0 0.002352941 0.002352941
2           1          0                    3.0 0.009900990 0.000000000
3           1          0                    2.0 0.027777778 0.000000000
4           1          0                    3.0 0.012345679 0.000000000
5           5          9                    2.4 0.085365854 0.000000000
6           1          0                    1.0 0.000000000 0.142857143
       Condos Hhld.Income High.School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0.002352941    127.5026 0.007058824           0.009411765       0.01411765
2 0.000000000    905.8534 0.039603960           0.019801980       0.00000000
3 0.000000000   2407.2750 0.000000000           0.000000000       0.08333333
4 0.000000000   1764.7486 0.049382716           0.012345679       0.01234568
5 0.048780488    336.3186 0.091463415           0.000000000       0.07926829
6 0.000000000   9959.4643 0.142857143           0.000000000       0.00000000
        Empl       UnEmp    Tot_Mig
1 0.01176471 0.004705882 0.02352941
2 0.04950495 0.000000000 0.00990099
3 0.05555556 0.027777778 0.02777778
4 0.08641975 0.000000000 0.07407407
5 0.10365854 0.012195122 0.30487805
6 0.14285714 0.000000000 0.00000000

[176857 rows x 34 columns] 

The testing or holdout or validation sample also has a column indicating the assigned segment for each individual.

fs.split[[2]] 
  Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.03188337    Seg2  0.02481761       0.01809865    0.01176471
2 0.05467271    Seg2  0.05312974       0.06034417    0.03090909
3 0.08517833    Seg2  0.10082414       0.09335045    0.04000000
4 0.10647280    Seg2  0.07367943       0.02017770    0.03656250
5 0.04973771    Seg2  0.05238778       0.02002063    0.03150000
6 0.10034801    Seg2  0.10569465       0.07024781    0.04342105
  Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1       0.02705882     0.02117647  0.02823529   0.009411765   0.03529412
2       0.03272727     0.02181818  0.04545454   0.009090909   0.04181818
3       0.03750000     0.04250000  0.05500000   0.012500000   0.05000000
4       0.03937500     0.05062500  0.04781250   0.019687500   0.06187500
5       0.04050000     0.03150000  0.05100000   0.031500000   0.04350000
6       0.04342105     0.05500000  0.06368421   0.043421053   0.06368421
  Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.02000000       0.01647059       0.01411765        0.01529412
2 0.02909091       0.02909091       0.03272727        0.02181818
3 0.02000000       0.01500000       0.02500000        0.01000000
4 0.03937500       0.04218750       0.02531250        0.03656250
5 0.02850000       0.02550000       0.02400000        0.02250000
6 0.03473684       0.04052632       0.05210526        0.04921053
  X.Smt_Hm_Dev.Pln Fin_App_Trst   CODE Pop_Age M_Pop_Age F_Pop_Age X._Pvt_HHld
1       0.01294118   0.06588235 N2L2J2    85.0      86.1      83.5         5.0
2       0.02181818   0.09272727 L9M1Z4    37.5      35.0      67.5         2.0
3       0.02250000   0.13250000 N7G3J7    85.0      37.5      87.4         3.0
4       0.03093750   0.11531250 M9A5E4    86.6      33.2      88.6        13.0
5       0.02250000   0.07950000 M1N1M9    79.5      68.6      81.9         5.3
6       0.04342105   0.15052632 M1N4E6    68.9      65.0      70.0         3.0
  X._Fam_Hhld X._NF_Hhld Avg_Chld_PrCen_Fm_Hhld     Houses        Apt
1           1          0                    1.0 0.02941176 0.00000000
2           1          0                    1.0 0.09090909 0.00000000
3           1          0                    1.0 0.12500000 0.00000000
4           1          0                    0.0 0.00000000 0.03125000
5           2          1                    2.0 0.03750000 0.00000000
6          14          6                    1.4 0.07894737 0.05263158
      Condos Hhld.Income High.School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0.00000000   1528.1129  0.02941176            0.02941176       0.02941176
2 0.00000000  28276.7809  0.09090909            0.00000000       0.09090909
3 0.00000000  10098.6787  0.12500000            0.00000000       0.00000000
4 0.03125000   2608.7778  0.09375000            0.06250000       0.12500000
5 0.01250000    359.3750  0.05000000            0.03750000       0.06250000
6 0.03289474    318.8611  0.07894737            0.07894737       0.11184210
        Empl       UnEmp    Tot_Mig
1 0.08823529 0.029411765 0.02941176
2 0.18181818 0.000000000 0.00000000
3 0.25000000 0.000000000 0.00000000
4 0.09375000 0.031250000 0.21875000
5 0.11250000 0.000000000 0.11250000
6 0.11184210 0.006578947 0.22368421

[75869 rows x 34 columns] 

6.7 Level 1: Random Forest

The first predictive model using randomForest. It is flexible and highly-valued methodology that splits the data using a tree-splitting methodology. Instead of building just one tree, the model below builds 200 trees.

fs4C.class_rf <- h2o.randomForest(         
  training_frame   = fs.split[[1]],       
  validation_frame = fs.split[[2]],        ## the H2O frame for validation (not required)
  x=c(1,3:16),                       ## the training sample predictor columns, by column index
  y=2,                          ## the target index (what we are predicting)
  model_id = "Random_Forest",    ## name the model in H2O
  ##   not required, but helps use Flow
  ntrees = 200,               
  stopping_rounds = 2,          
  score_each_iteration = T,      
  seed = 1000000)  

6.7.1 Confusion Matrices for RF

Training sample

While this might be interesting, it is best not to place too much credence on the hit-ratio for the training sample since this sample is used to build the model.

h2o.confusionMatrix(fs4C.class_rf,  fs.split[[1]]   )   
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error          Rate
Seg1   29081     1     0     3 0.0001 =  4 / 29,085
Seg2       0 53576     0     3 0.0001 =  3 / 53,579
Seg3       0     0 43924     0 0.0000 =  0 / 43,924
Seg4       1     1     0 50267 0.0000 =  2 / 50,269
Totals 29082 53578 43924 50273 0.0001 = 9 / 176,857

Testing Sample

Successful prediction is clearly showcased for segments as seen in the confusion matrix below. Using the RF model, 3,895 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 5.13%, which is very good. This means that the hit ratio was roughly 94.87%, which is very credible.

h2o.confusionMatrix(fs4C.class_rf,  fs.split[[2]]   )   
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error             Rate
Seg1   11475   404   291   314 0.0808 = 1,009 / 12,484
Seg2     326 22043   103   723 0.0497 = 1,152 / 23,195
Seg3     225    54 18118   274 0.0296 =   553 / 18,671
Seg4     200   642   450 20227 0.0600 = 1,292 / 21,519
Totals 12226 23143 18962 21538 0.0528 = 4,006 / 75,869

6.7.2 Hit Ratios for RF

As briefly mentioned above, the hit ratio is roughly 94.87%, which is fantastic. This alone is not enough to deduce that the model is good, therefore below we have created a recepticle of metrics to explore futher for all the models.

h2o.hit_ratio_table(fs4C.class_rf, train=TRUE, valid=TRUE)   
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.935959
2 2  0.992796
3 3  0.997614
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.947198
2 2  0.995703
3 3  0.999407
4 4  1.000000

6.7.3 Creating an object to hold important diagnostic information

A receptacle for diagnostic statistics is created below so that this information can be compared efficiently at the end of the predictive modeling. We will be comparing the hit ratios, MSEs, RMSEs, loglosses, and mean-per-class-errors.

modH <- data.frame(Prediction_model=character(),
                  hit_ratio=numeric(),
                  MSE=numeric(),
                  RMSE=numeric(),
                  logloss=numeric(),
                  mean_per_class_error=numeric(),
                  stringsAsFactors=FALSE) 
modH[1, 1] <- "Random_forest"
modH[1, 2] <- fs4C.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[1, 3] <- fs4C.class_rf@model$validation_metrics@metrics$MSE   #  
modH[1, 4] <- fs4C.class_rf@model$validation_metrics@metrics$RMSE       #  
modH[1, 5] <- fs4C.class_rf@model$validation_metrics@metrics$ logloss
modH[1, 6] <- fs4C.class_rf@model$validation_metrics@metrics$ mean_per_class_error  

6.7.4 Plotting variable importances

Here we are predicting importance of variables using plotly package available in H2O function. We can see that the top five variables with the highest relative importances are:

  1. Dollar value of annual premiums for home insurance (by a long shot)
  2. Number of people who own smart thermostats
  3. Number of people having trust in financial/banking app
  4. Number of people who purchased smart phones
  5. Number of people who plan on purchasing smart home technology
rf_variable_importances <- as.data.frame(fs4C.class_rf@model$variable_importances)
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
library(plotly)
plot_ly(rf_variable_importances, 
        #        x = rf_variable_importances$percentage, 
        y=reorder(rf_variable_importances$variable,
                  rf_variable_importances$percentage),
        x = rf_variable_importances$percentage,
        color = rf_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for the random forest model",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120))  

6.8 Level 1: Logistic Regression

Here we are implementing logistic regression model for the 4 segment solution.

fs4C.class_glm <- h2o.glm(
  family= "multinomial",  
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=c(1,3:16),                        ## the predictor columns, by column index
  y=2,
  lambda=0
)  

6.8.1 Confusion Matrices for GLM

Training sample

Similarly to what we did for random forest, here we are checking confusion matrix for all segments within the training sample.

# str(fs8C.class_glm)
fs4C.class_glm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error            Rate
Seg1   29038    22     4    21 0.0016 =   47 / 29,085
Seg2      15 53532     1    31 0.0009 =   47 / 53,579
Seg3      22     2 43875    25 0.0011 =   49 / 43,924
Seg4       6    25    20 50218 0.0010 =   51 / 50,269
Totals 29081 53581 43900 50295 0.0011 = 194 / 176,857
# h2o.confusionMatrix( fs4C.class_glm, train = TRUE)

Testing sample

Successful prediction is clearly showcased for segments as seen in the confusion matrix below. Using the GLM model, 102 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 0.13%, which is even better than random forest good. This means that the hit ratio was roughly 99.87%, which is even more credible than random forest.

# str(fs4C.class_glm)
fs4C.class_glm@ model$ validation_metrics@ metrics$ cm$ table 
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error           Rate
Seg1   12450    16     1    17 0.0027 =  34 / 12,484
Seg2       8 23162     1    24 0.0014 =  33 / 23,195
Seg3       7     0 18654    10 0.0009 =  17 / 18,671
Seg4       5    15    11 21488 0.0014 =  31 / 21,519
Totals 12470 23193 18667 21539 0.0015 = 115 / 75,869
# h2o.confusionMatrix(fs4C.class_glm, valid = TRUE) # another way to see confusion matrix

6.8.2 Plotting variable importance

Looking at the variable importances, we can see that the top five variables in terms of them having the highest relative importance are:

  1. Dollar value of annual premiums for home insurance (by a long shot)
  2. Number of people who own smart thermostats
  3. Number of people who purchased smart phones
  4. Number of people having trust in financial/banking app
  5. Number of people who plan on purchasing smart home technology
h2o.varimp(fs4C.class_glm)
            variable relative_importance scaled_importance percentage
1   Ann_Prem_Hm_Insu           46.408944        1.00000000 0.19169294
2        X.Smt_Therm           33.312419        0.71780169 0.13759752
3     X..Smart_purch           23.821684        0.51329943 0.09839588
4       Fin_App_Trst           23.526912        0.50694780 0.09717832
5        Tech.Enthu.           20.993106        0.45235043 0.08671239
6       X.Smt_Hm_Ass           17.327060        0.37335604 0.07156972
7         Confidence           15.051687        0.32432729 0.06217125
8   X.Smt_Hm_Sec.Pln           11.359219        0.24476356 0.04691945
9         Smt_Hm_Dev           11.028004        0.23762669 0.04555136
10     X.Smt_Hm_Secu           10.383195        0.22373263 0.04288797
11  X.Smt_Hm_Dev.Pln            7.100212        0.15299231 0.02932755
12  Mob_Inf_sec_cons            7.016669        0.15119217 0.02898247
13  X.Smat_Therm.Pln            5.232745        0.11275295 0.02161394
14 X.Smt_Hm_Asst.Pln            4.952652        0.10671763 0.02045702
15     Mob_Mark_cons            4.585922        0.09881548 0.01894223
#glm_variable_importances <- as.data.frame(fs4C.class_glm@model$variable_importances)
glm_variable_importances <- as.data.frame(h2o.varimp(fs4C.class_glm))
## rf_variable_importances
##install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(glm_variable_importances,
        #        x = rf_variable_importances$percentage,
        y=reorder(glm_variable_importances$variable,
                 glm_variable_importances$percentage),
        x = glm_variable_importances$percentage,
        color = glm_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for the logistic regression model on 4 segments",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120))
modH[2, 1] <- "GLM_log_regr"
modH[2, 2] <- fs4C.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[2, 3] <- fs4C.class_glm@model$validation_metrics@metrics$MSE   #  
modH[2, 4] <- fs4C.class_glm@model$validation_metrics@metrics$RMSE       #  
modH[2, 5] <- fs4C.class_glm@model$validation_metrics@metrics$ logloss
modH[2, 6] <- fs4C.class_glm@model$validation_metrics@metrics$ mean_per_class_error 

6.9 Level 1: Gradient Boosting Machine

The gradient boosting machine model is a type of neural network algorithm that can be very effective. Hence we are trying to explore this model as well to check if this provides better results or not.

#GBM Gradient Boosting Machine
fs4C.class_gbm<- h2o.gbm(
  distribution="AUTO",
  training_frame   = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=c(1,3:16),                        ## the predictor columns, by column index
  y=2,
  model_id = "fs.gbm",
  stopping_rounds = 3,
  histogram_type = "UniformAdaptive" ,
  stopping_tolerance = 1e-2,
  seed = 1234
)

6.9.1 Model Performance

perf <- h2o.performance(fs4C.class_gbm, fs.split[[2]])
perf
H2OMultinomialMetrics: gbm

Test Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.05102248
RMSE: (Extract with `h2o.rmse`) 0.2258816
Logloss: (Extract with `h2o.logloss`) 0.1765651
Mean Per-Class Error: 0.05254961
R^2: (Extract with `h2o.r2`) 0.954615
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error             Rate
Seg1   11603   335   233   313 0.0706 =   881 / 12,484
Seg2     243 22154    66   732 0.0449 = 1,041 / 23,195
Seg3     347    99 17931   294 0.0396 =   740 / 18,671
Seg4     178   595   413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.949281
2 2  0.995611
3 3  0.999697
4 4  1.000000

6.9.2 Confusion Matrices

# str(fs4C.class_glm)
fs4C.class_gbm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   27376   630   430   649 0.0588 =  1,709 / 29,085
Seg2     453 51684   101  1341 0.0354 =  1,895 / 53,579
Seg3     587   167 42569   601 0.0308 =  1,355 / 43,924
Seg4     312  1114   863 47980 0.0455 =  2,289 / 50,269
Totals 28728 53595 43963 50571 0.0410 = 7,248 / 176,857
# h2o.confusionMatrix( fs4C.class_glm,train = TRUE)

Testing Sample

The confusion matrix for the testing sample is as below

h2o.confusionMatrix(fs4C.class_gbm, fs.split[[2]]) # CONFUSION TABLE FOR HOLDOUT
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error             Rate
Seg1   11603   335   233   313 0.0706 =   881 / 12,484
Seg2     243 22154    66   732 0.0449 = 1,041 / 23,195
Seg3     347    99 17931   294 0.0396 =   740 / 18,671
Seg4     178   595   413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869
h2o.confusionMatrix(perf) # DIFFERENT WAY
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error             Rate
Seg1   11603   335   233   313 0.0706 =   881 / 12,484
Seg2     243 22154    66   732 0.0449 = 1,041 / 23,195
Seg3     347    99 17931   294 0.0396 =   740 / 18,671
Seg4     178   595   413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869

Training Sample

Using the GBM model, 3,714 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 4.89%, which is better than that of the Random Forest Model, but not as low as the GLM Model. This means that the hit ratio was roughly 95.10%, which is still a great hit ratio. All the models seem to have great hit ratios (all are 94% or above).

# str(fs6C.class_gbm)
# fs6C.class_glm@ model$ validation_metrics@ metrics$ cm$ table
h2o.confusionMatrix(fs4C.class_gbm,valid = TRUE)
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error             Rate
Seg1   11603   335   233   313 0.0706 =   881 / 12,484
Seg2     243 22154    66   732 0.0449 = 1,041 / 23,195
Seg3     347    99 17931   294 0.0396 =   740 / 18,671
Seg4     178   595   413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869

6.9.3 Plotting variable importance

h2o.varimp(fs4C.class_gbm)
Variable Importances: 
            variable relative_importance scaled_importance percentage
1        X.Smt_Therm       107209.156250          1.000000   0.214047
2   Ann_Prem_Hm_Insu       102356.835938          0.954740   0.204359
3       Fin_App_Trst        70859.429688          0.660946   0.141474
4     X..Smart_purch        61462.792969          0.573298   0.122713
5   X.Smt_Hm_Sec.Pln        33581.839844          0.313237   0.067047
6        Tech.Enthu.        29895.560547          0.278853   0.059688
7       X.Smt_Hm_Ass        24125.326172          0.225030   0.048167
8      X.Smt_Hm_Secu        20156.285156          0.188009   0.040243
9         Smt_Hm_Dev        15659.624023          0.146066   0.031265
10        Confidence        14745.135742          0.137536   0.029439
11  Mob_Inf_sec_cons         8862.408203          0.082665   0.017694
12  X.Smt_Hm_Dev.Pln         8198.396484          0.076471   0.016368
13  X.Smat_Therm.Pln         1741.412231          0.016243   0.003477
14 X.Smt_Hm_Asst.Pln         1040.876465          0.009709   0.002078
15     Mob_Mark_cons          971.792908          0.009064   0.001940

Looking at the variable importances, we can see that the top five variables in terms of them having the highest relative importance are:

  1. Number of people who own smart thermostats
  2. Dollar value of annual premiums for home insurance
  3. Number of people who trust in financial banking applications
  4. Number of people who purchased smart phones
  5. Number of people planning to purchase smart home security technology
# glm_variable_importances <- as.data.frame(fs6C.class_glm@model$variable_importances)
gbm_variable_importances <- as.data.frame(h2o.varimp(fs4C.class_gbm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(gbm_variable_importances,
        #        x = rf_variable_importances$percentage,
        y=reorder(gbm_variable_importances$variable,
                  gbm_variable_importances$percentage),
        x = gbm_variable_importances$percentage,
        color = gbm_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for GBM model of 4-segment solution",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120))

6.10 Level 1: Comparing All of the Models

The GLM logistic regression model performed the best across all metrics, and is thus the best predictor. It had the:

  • Highest hit ratio
  • Lowest MSE
  • Lowest RSE
  • Lowest log loss
  • Lowest mean-per-class-error
modH[3, 1] <- "Gradient Boosting"
modH[3, 2] <- fs4C.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[3, 3] <- fs4C.class_gbm@model$validation_metrics@metrics$MSE   #  
modH[3, 4] <- fs4C.class_gbm@model$validation_metrics@metrics$RMSE       #  
modH[3, 5] <- fs4C.class_gbm@model$validation_metrics@metrics$ logloss
modH[3, 6] <- fs4C.class_gbm@model$validation_metrics@metrics$ mean_per_class_error 
modH %>% tab_df( show.rownames = TRUE, sort.column = -2, title = "Statistics of predictive models for the 4-segment solution")
Statistics of predictive models for the 4-segment solution
Row Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
2 GLM_log_regr 1.00 0.01 0.07 0.02 0.00
3 Gradient Boosting 0.95 0.05 0.23 0.18 0.05
1 Random_forest 0.95 0.06 0.24 0.21 0.06

This plot simply compares the hit ratios across all models. As mentioned above, the GLM (logistic regression) has the highest hit-ratio of 0.9984842. This along with the other metrics mentioned above, suggest that GLM is the superior model.

modH_hit <- modH[, 1:2]
modH_hit_long <- melt(modH_hit)
# Using Prediction_model as id variables
ggplot(data=modH_hit_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
   geom_bar(  stat="identity" , fill = "lightblue", width = 0.3)+
   geom_point(aes( color= Prediction_model ), size=6) +
   theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
   labs( title= "Hit ratios for predictive models", y="Proportion correct",
         x= "Prediction Model")

6.11 Level 2 Segmentation

Level 2 of the segmentation process focuses on the very practical problem of attempting to assign people to segments based on variables that may be more accessible than the attitudinal variables, i.e., demographic variables. This second level of segmentation will use only the demographic variables which include the following:

  • Total Population Median Age
  • Male Population Median Age
  • Female Population Median Age
  • Average Number Of Persons In Private Households
  • Total Family Households
  • Non-Family Households
  • Average Children Per Census Family Household
  • Houses
  • Apartment, Building Low And High Rise
  • Condos
  • Median Household Income
  • High School Certificate Or Equivalent
  • College, CEGEP Or Other Non-University Certificate Or Diploma
  • University Certificate Or Diploma Below Bachelor
  • Employed
  • Unemployed
  • Total Immigrant
fs4C.class <- h2o.importFile("C:/Jaspreet Marketing Analytics Project/FS_4.csv")

Labeled buckets for each of the demographic variables have been prepared in the chunk below for interpretation purposes.

fs4C.class$Pop_Age_T<- cut(fs4C.class$Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$M_Pop_Age_T<- cut(fs4C.class$M_Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$F_Pop_Age_T<- cut(fs4C.class$F_Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$X._Pvt_HHld_T<- cut(fs4C.class$X._Pvt_HHld, c(-1,2,4,6,8,15), labels = c("0-2 Persons Private Households", "3-4 Persons Private Households", "5-6 Persons Private Households", "7-8 Persons Private Households", "8+ Persons Private Households"))
fs4C.class$X._Fam_Hhld_T<- cut(fs4C.class$X._Fam_Hhld, c(-1,5,10,15,20,25,30,35,40,10000), labels = c("0-5 Family Households", "6-10 Family Households", "11-15 Family Households", "16-20 Family Households", "21-25 Family Households", "26-30 Family Households", "31-35 Family Households", "36-40 Family Households", "40+ Family Households"))

fs4C.class$X._NF_Hhld_T<- cut(fs4C.class$X._NF_Hhld, c(-1,1,3,5,7,1500), labels = c("0-1 Non-Family Households", "2-3 Non-Family Households", "4-5 Non-Family Households", "6-7 Non-Family Households", "9+ Non-Family Households"))

fs4C.class$Avg_Chld_PrCen_Fm_Hhld_T<- cut(fs4C.class$Avg_Chld_PrCen_Fm_Hhld, c(-1,1,2,3,4,14), labels = c("0-1 Children Per Household", "2 Children Per Household", "3 Children Per Household", "4 Children Per Household", "5 or More Children Per Household"))

#Houses
fs4C.class$Houses_T<- cut(fs4C.class$Houses, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Living in Houses", 
                                     "26-50% Living in Houses", 
                                     "51-75%  Living in Houses", 
                                     "More than 76%  Living in Houses"))

#Apartments
fs4C.class$Apt_T<- cut(fs4C.class$Apt, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Living in Apartments", 
                                     "26-50% Years  Living in Apartments", 
                                     "51-75%  Living in Apartments", 
                                     "More than 76%  Living in Apartments"))

#Condos
fs4C.class$Condos_T<- cut(fs4C.class$Condos, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Living in Condos", 
                                     "26-50% Years  Living in Condos", 
                                     "51-75%  Living in Condos", 
                                     "More than 76%  Living in Condos"))

#Household Income
fs4C.class$Hhld.Income_T<- cut(fs4C.class$Hhld.Income, c(-1,20000,40000,60000,80000,100000,125000,150000,200000,300000,1600000), 
                               labels = c("Household income $0 to $19,999", 
                                          "Household income $20,000 to $39,999", 
                                          "Household income $40,000 to $59,999",
                                          "Household income $60,000 to $79,999",
                                          "Household income $80,000 to $99,999",
                                          "Household income $100,000 to $124,999",
                                          "Household income $125,000 to $149,999",
                                          "Household income $150,000 to $199,999",
                                          "Household income $200,000 to $299,999",
                                          "Household income $300,000 or over"))

#High School
fs4C.class$High.School_T<- cut(fs4C.class$High.School, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% completed High School", 
                                     "26-50% completed High School", 
                                     "51-75%  completed High School", 
                                     "More than 76%  completed High School"))

#Non-University Diploma
fs4C.class$Clg_CEGEP_Non_Uni_Dip_T<- cut(fs4C.class$Clg_CEGEP_Non_Uni_Dip, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% completed Non-University Diploma", 
                                     "26-50% completed Non-University Diploma", 
                                     "51-75%  completed Non-University Diploma", 
                                     "More than 76%  completed Non-University Diploma"))

#University Degree
fs4C.class$Uni_Dip_Blw_Bach_T<- cut(fs4C.class$Uni_Dip_Blw_Bach, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% completed University Degree", 
                                     "26-50% completed University Degree", 
                                     "51-75%  completed University Degree", 
                                     "More than 76%  completed University Degree"))

#Employed
fs4C.class$Empl_T<- cut(fs4C.class$Empl, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Employed", 
                                     "26-50% Employed", 
                                     "51-75%  Employed", 
                                     "More than 76%  Employed"))

#Unemployed
fs4C.class$UnEmp_T<- cut(fs4C.class$UnEmp, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Unemployed", 
                                     "26-50% Unemployed", 
                                     "51-75%  Unemployed", 
                                     "More than 76% Unemployed"))

#Total Immigrants
fs4C.class$Tot_Mig_T<- cut(fs4C.class$Tot_Mig, c(-1,0.25,0.50,0.75,1.00), 
                          labels = c("0-25% Immigrants", 
                                     "26-50% Immigrants", 
                                     "51-75% Immigrants", 
                                     "More than 76%  Immigrants"))

Once we have recoded demographic variables we are selecting only demographic variables from the entire dataset for further analysis

fs4C.class<-fs4C.class[-c(18:34)]
head(fs4C.class)
   Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962    Seg2 0.009739573      0.004403856   0.005647059
2 0.025949382    Seg2 0.014750739      0.005865521   0.012673267
3 0.025238024    Seg2 0.029873820      0.020744546   0.015555556
4 0.031883370    Seg2 0.024817610      0.018098652   0.011764706
5 0.030433902    Seg2 0.028479670      0.009048419   0.018765432
6 0.056125319    Seg2 0.061331941      0.063795120   0.030182927
  Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1      0.007529412    0.007152941 0.009788235   0.004517647   0.01204706
2      0.021386139    0.014257426 0.022178218   0.007128713   0.02534653
3      0.025555556    0.018888889 0.028888889   0.012222222   0.03222222
4      0.027058824    0.021176471 0.028235294   0.009411765   0.03529412
5      0.027654321    0.024691358 0.023703704   0.011851852   0.02666667
6      0.028170732    0.024146341 0.042256098   0.012073171   0.03823171
   Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941      0.004517647      0.007905882       0.002635294
2 0.011881188      0.011089109      0.012673267       0.005544554
3 0.018888889      0.011111111      0.018888889       0.011111111
4 0.020000000      0.016470588      0.014117647       0.015294118
5 0.018765432      0.017777778      0.021728395       0.012839506
6 0.024146341      0.032195122      0.028170732       0.034207317
  X.Smt_Hm_Dev.Pln Fin_App_Trst   CODE          Pop_Age_T        M_Pop_Age_T
1      0.004141176   0.01882353 K7L5R2        26-50 Years        26-50 Years
2      0.011089109   0.04514852 K9V6H4        26-50 Years        26-50 Years
3      0.008888889   0.05666667 N2L2G4 More than 76 Years More than 76 Years
4      0.012941176   0.06588235 N2L2J2 More than 76 Years More than 76 Years
5      0.015802469   0.05827160 M6B2P9 More than 76 Years More than 76 Years
6      0.030182927   0.09054878 N7L5H5 More than 76 Years        51-75 Years
         F_Pop_Age_T                  X._Pvt_HHld_T         X._Fam_Hhld_T
1         0-25 Years 5-6 Persons Private Households 0-5 Family Households
2        26-50 Years  8+ Persons Private Households 0-5 Family Households
3 More than 76 Years 7-8 Persons Private Households 0-5 Family Households
4 More than 76 Years 5-6 Persons Private Households 0-5 Family Households
5 More than 76 Years  8+ Persons Private Households 0-5 Family Households
6 More than 76 Years 3-4 Persons Private Households 0-5 Family Households
               X._NF_Hhld_T   Avg_Chld_PrCen_Fm_Hhld_T               Houses_T
1 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
3 0-1 Non-Family Households   2 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
6  9+ Non-Family Households   3 Children Per Household 0-25% Living in Houses
                       Apt_T               Condos_T
1 0-25% Living in Apartments 0-25% Living in Condos
2 0-25% Living in Apartments 0-25% Living in Condos
3 0-25% Living in Apartments 0-25% Living in Condos
4 0-25% Living in Apartments 0-25% Living in Condos
5 0-25% Living in Apartments 0-25% Living in Condos
6 0-25% Living in Apartments 0-25% Living in Condos
                   Hhld.Income_T               High.School_T
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
                 Clg_CEGEP_Non_Uni_Dip_T                Uni_Dip_Blw_Bach_T
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
          Empl_T          UnEmp_T         Tot_Mig_T
1 0-25% Employed 0-25% Unemployed  0-25% Immigrants
2 0-25% Employed 0-25% Unemployed  0-25% Immigrants
3 0-25% Employed 0-25% Unemployed  0-25% Immigrants
4 0-25% Employed 0-25% Unemployed  0-25% Immigrants
5 0-25% Employed 0-25% Unemployed  0-25% Immigrants
6 0-25% Employed 0-25% Unemployed 26-50% Immigrants

Here we are renaming variables to the original names that we have considered initially

library(sjmisc)

names(fs4C.class)[names(fs4C.class) == "Pop_Age_T"] <- "Pop_Age"
names(fs4C.class)[names(fs4C.class) == "M_Pop_Age_T"] <- "M_Pop_Age"
names(fs4C.class)[names(fs4C.class) == "F_Pop_Age_T"] <- "F_Pop_Age"
names(fs4C.class)[names(fs4C.class) == "X._Pvt_HHld_T"] <- "Pvt_HHld"
names(fs4C.class)[names(fs4C.class) == "X._Fam_Hhld_T"] <- "Fam_Hhld"
names(fs4C.class)[names(fs4C.class) == "X._NF_Hhld_T"] <- "NF_Hhld"
names(fs4C.class)[names(fs4C.class) == "Avg_Chld_PrCen_Fm_Hhld_T"] <- "Avg_Chld_PrCen_Fm_Hhld"
names(fs4C.class)[names(fs4C.class) == "Houses_T"] <- "Houses"
names(fs4C.class)[names(fs4C.class) == "Apt_T"] <- "Apt"
names(fs4C.class)[names(fs4C.class) == "Condos_T"] <- "Condos"
names(fs4C.class)[names(fs4C.class) == "Hhld.Income_T"] <- "Hhld.Income"
names(fs4C.class)[names(fs4C.class) == "High.School_T"] <- "High.School"
names(fs4C.class)[names(fs4C.class) == "Clg_CEGEP_Non_Uni_Dip_T"] <- "Clg_CEGEP_Non_Uni_Dip"
names(fs4C.class)[names(fs4C.class) == "Uni_Dip_Blw_Bach_T"] <- "Uni_Dip_Blw_Bach"
names(fs4C.class)[names(fs4C.class) == "Empl_T"] <- "Empl"
names(fs4C.class)[names(fs4C.class) == "UnEmp_T"] <- "UnEmp"
names(fs4C.class)[names(fs4C.class) == "Tot_Mig_T"] <- "Tot_Mig"

colnames(fs4C.class)
 [1] "Confidence"             "predict"                "Tech.Enthu."           
 [4] "Ann_Prem_Hm_Insu"       "Mob_Mark_cons"          "Mob_Inf_sec_cons"      
 [7] "X..Smart_purch"         "X.Smt_Therm"            "X.Smt_Hm_Secu"         
[10] "X.Smt_Hm_Ass"           "Smt_Hm_Dev"             "X.Smat_Therm.Pln"      
[13] "X.Smt_Hm_Sec.Pln"       "X.Smt_Hm_Asst.Pln"      "X.Smt_Hm_Dev.Pln"      
[16] "Fin_App_Trst"           "CODE"                   "Pop_Age"               
[19] "M_Pop_Age"              "F_Pop_Age"              "Pvt_HHld"              
[22] "Fam_Hhld"               "NF_Hhld"                "Avg_Chld_PrCen_Fm_Hhld"
[25] "Houses"                 "Apt"                    "Condos"                
[28] "Hhld.Income"            "High.School"            "Clg_CEGEP_Non_Uni_Dip" 
[31] "Uni_Dip_Blw_Bach"       "Empl"                   "UnEmp"                 
[34] "Tot_Mig"               

6.11.1 Variable Exploration

fs2<-as.data.frame(fs4C.class)

The following bar charts simply show the number of people or number of households in each category.

library(ggplot2)
ggplot(data.frame(fs2), aes(x=Pop_Age)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=M_Pop_Age)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=F_Pop_Age)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Pvt_HHld)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Fam_Hhld)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=NF_Hhld)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Avg_Chld_PrCen_Fm_Hhld)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Houses)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Apt)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Condos)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Hhld.Income)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=High.School)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Clg_CEGEP_Non_Uni_Dip)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Uni_Dip_Blw_Bach)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Empl)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=UnEmp)) +
  geom_bar()

ggplot(data.frame(fs2), aes(x=Tot_Mig)) +
  geom_bar()

6.11.2 Splitting the sample for cross-validation

We will start with splitting our dataset into training and testing set.

#__splitting df into training (70%) and testing (validation) datasets________
fs.split<- h2o.splitFrame(fs4C.class, ratios=c(0.7))

The training sample on which the predictive model will be built is below.

fs.split[[1]]
   Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962    Seg2 0.009739573      0.004403856   0.005647059
2 0.025949382    Seg2 0.014750739      0.005865521   0.012673267
3 0.025238024    Seg2 0.029873820      0.020744546   0.015555556
4 0.031883370    Seg2 0.024817610      0.018098652   0.011764706
5 0.056125319    Seg2 0.061331941      0.063795120   0.030182927
6 0.030407734    Seg2 0.031585243      0.020467166   0.011842105
  Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1      0.007529412    0.007152941 0.009788235   0.004517647   0.01204706
2      0.021386139    0.014257426 0.022178218   0.007128713   0.02534653
3      0.025555556    0.018888889 0.028888889   0.012222222   0.03222222
4      0.027058824    0.021176471 0.028235294   0.009411765   0.03529412
5      0.028170732    0.024146341 0.042256098   0.012073171   0.03823171
6      0.030263158    0.015789474 0.034210526   0.019736842   0.04473684
   Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941      0.004517647      0.007905882       0.002635294
2 0.011881188      0.011089109      0.012673267       0.005544554
3 0.018888889      0.011111111      0.018888889       0.011111111
4 0.020000000      0.016470588      0.014117647       0.015294118
5 0.024146341      0.032195122      0.028170732       0.034207317
6 0.025000000      0.010526316      0.019736842       0.009210526
  X.Smt_Hm_Dev.Pln Fin_App_Trst   CODE            Pop_Age          M_Pop_Age
1      0.004141176   0.01882353 K7L5R2        26-50 Years        26-50 Years
2      0.011089109   0.04514852 K9V6H4        26-50 Years        26-50 Years
3      0.008888889   0.05666667 N2L2G4 More than 76 Years More than 76 Years
4      0.012941176   0.06588235 N2L2J2 More than 76 Years More than 76 Years
5      0.030182927   0.09054878 N7L5H5 More than 76 Years        51-75 Years
6      0.018421053   0.07236842 N3L4E3        51-75 Years        51-75 Years
           F_Pop_Age                       Pvt_HHld              Fam_Hhld
1         0-25 Years 5-6 Persons Private Households 0-5 Family Households
2        26-50 Years  8+ Persons Private Households 0-5 Family Households
3 More than 76 Years 7-8 Persons Private Households 0-5 Family Households
4 More than 76 Years 5-6 Persons Private Households 0-5 Family Households
5 More than 76 Years 3-4 Persons Private Households 0-5 Family Households
6        51-75 Years 7-8 Persons Private Households 0-5 Family Households
                    NF_Hhld     Avg_Chld_PrCen_Fm_Hhld                 Houses
1 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
3 0-1 Non-Family Households   2 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5  9+ Non-Family Households   3 Children Per Household 0-25% Living in Houses
6 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
                         Apt                 Condos
1 0-25% Living in Apartments 0-25% Living in Condos
2 0-25% Living in Apartments 0-25% Living in Condos
3 0-25% Living in Apartments 0-25% Living in Condos
4 0-25% Living in Apartments 0-25% Living in Condos
5 0-25% Living in Apartments 0-25% Living in Condos
6 0-25% Living in Apartments 0-25% Living in Condos
                     Hhld.Income                 High.School
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
                   Clg_CEGEP_Non_Uni_Dip                  Uni_Dip_Blw_Bach
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
            Empl            UnEmp           Tot_Mig
1 0-25% Employed 0-25% Unemployed  0-25% Immigrants
2 0-25% Employed 0-25% Unemployed  0-25% Immigrants
3 0-25% Employed 0-25% Unemployed  0-25% Immigrants
4 0-25% Employed 0-25% Unemployed  0-25% Immigrants
5 0-25% Employed 0-25% Unemployed 26-50% Immigrants
6 0-25% Employed 0-25% Unemployed  0-25% Immigrants

[176995 rows x 34 columns] 

The holdout, or testing, sample will be used to help understand the ability of the model to predict segment membership.

fs.split[[2]]
  Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.03043390    Seg2  0.02847967      0.009048419    0.01876543
2 0.07610093    Seg2  0.03983030      0.092729259    0.02285714
3 0.17222783    Seg2  0.09188291      0.191906973    0.04990826
4 0.08864146    Seg2  0.09336437      0.015857921    0.03712871
5 0.07195438    Seg2  0.07085803      0.070776382    0.04000000
6 0.06057126    Seg2  0.07169717      0.074680364    0.03733333
  Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1       0.02765432     0.02469136  0.02370370   0.011851852   0.02666667
2       0.02857143     0.01285714  0.01857143   0.005714286   0.02000000
3       0.04366972     0.05614679  0.03119266   0.006238532   0.04990826
4       0.04455445     0.06930693  0.06188119   0.032178218   0.06188119
5       0.04500000     0.04750000  0.05000000   0.020000000   0.06000000
6       0.04666667     0.03033333  0.05366667   0.018666667   0.04200000
  Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.01876543       0.01777778       0.02172840        0.01283951
2 0.01285714       0.01142857       0.01142857        0.01285714
3 0.03119266       0.01871560       0.03119266        0.04990826
4 0.04702970       0.03465347       0.05693069        0.02722772
5 0.04500000       0.03750000       0.04500000        0.02500000
6 0.02566667       0.03733333       0.03966667        0.02566667
  X.Smt_Hm_Dev.Pln Fin_App_Trst   CODE            Pop_Age          M_Pop_Age
1       0.01580247   0.05827160 M6B2P9 More than 76 Years More than 76 Years
2       0.01000000   0.07571429 K6H6X7 More than 76 Years More than 76 Years
3       0.03743119   0.24330275 L4A0P5        51-75 Years        51-75 Years
4       0.03217822   0.13613861 M6B4E5 More than 76 Years        51-75 Years
5       0.04000000   0.13500000 L9M2B9        26-50 Years        26-50 Years
6       0.02333333   0.10266667 L9C2C7 More than 76 Years        51-75 Years
           F_Pop_Age                       Pvt_HHld                Fam_Hhld
1 More than 76 Years  8+ Persons Private Households   0-5 Family Households
2 More than 76 Years 0-2 Persons Private Households   0-5 Family Households
3        51-75 Years 0-2 Persons Private Households 21-25 Family Households
4 More than 76 Years  8+ Persons Private Households   0-5 Family Households
5        26-50 Years 3-4 Persons Private Households   0-5 Family Households
6 More than 76 Years 3-4 Persons Private Households   0-5 Family Households
                    NF_Hhld     Avg_Chld_PrCen_Fm_Hhld                 Houses
1 0-1 Non-Family Households   3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
3  9+ Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
6 0-1 Non-Family Households   2 Children Per Household 0-25% Living in Houses
                                 Apt                 Condos
1         0-25% Living in Apartments 0-25% Living in Condos
2         0-25% Living in Apartments 0-25% Living in Condos
3 26-50% Years  Living in Apartments 0-25% Living in Condos
4         0-25% Living in Apartments 0-25% Living in Condos
5         0-25% Living in Apartments 0-25% Living in Condos
6         0-25% Living in Apartments 0-25% Living in Condos
                     Hhld.Income                 High.School
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
                   Clg_CEGEP_Non_Uni_Dip                  Uni_Dip_Blw_Bach
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
            Empl            UnEmp          Tot_Mig
1 0-25% Employed 0-25% Unemployed 0-25% Immigrants
2 0-25% Employed 0-25% Unemployed 0-25% Immigrants
3 0-25% Employed 0-25% Unemployed 0-25% Immigrants
4 0-25% Employed 0-25% Unemployed 0-25% Immigrants
5 0-25% Employed 0-25% Unemployed 0-25% Immigrants
6 0-25% Employed 0-25% Unemployed 0-25% Immigrants

[75731 rows x 34 columns] 

6.12 Level 2: Random forest model

Here we are working on Random Forest model which attempts to predict segment membership from the 4-segment solution developed in Level 1 Segmentation by using only the demographic variables.

fs3Cov.class_rf <- h2o.randomForest(         ## h2o.randomForest function
  training_frame   = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=18:34,                        ## the predictor columns, by column index
    y=2,                          ## the target index (what we are predicting)
  model_id = "RF_cov",   
  ntrees = 200,                 
  stopping_rounds = 2,           
  score_each_iteration = T,     
  seed = 1000000)                ## Set the random seed so that this can be reproduced.

6.12.1 Confusion Matrices

Training Sample The predictive ability of the random forest model on the training sample is quite poor. However, this information should not be relied upon.

h2o.confusionMatrix(fs3Cov.class_rf,  fs.split[[1]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error               Rate
Seg1   15287  5765  4458  3507 0.4732 =  13,730 / 29,017
Seg2    2542 36994  3719 10581 0.3128 =  16,842 / 53,836
Seg3    1524  2585 36299  3460 0.1725 =   7,569 / 43,868
Seg4    1653 11066  4100 33455 0.3345 =  16,819 / 50,274
Totals 21006 56410 48576 51003 0.3105 = 54,960 / 176,995

Testing Sample

The error rate and the corresponding hit-ratio for the testing sample are very poor compared to level one segmentation using random forest. 27,395 out of 75,754 people were incorrectly assigned their segment membership, resulting in an error rate of 36.16% or a hit ratio of 63.84%.

h2o.confusionMatrix(fs3Cov.class_rf,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   5636  2861  2244  1811 0.5510 =  6,916 / 12,552
Seg2   1453 14567  1832  5086 0.3649 =  8,371 / 22,938
Seg3    859  1261 14797  1810 0.2099 =  3,930 / 18,727
Seg4    922  5349  2071 13172 0.3877 =  8,342 / 21,514
Totals 8870 24038 20944 21879 0.3639 = 27,559 / 75,731

As mentioned above, the hit ratio for the training sample is 63.84%.

h2o.hit_ratio_table(fs3Cov.class_rf, train=TRUE, valid=TRUE )
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.636606
2 2  0.878381
3 3  0.968664
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.636094
2 2  0.878768
3 3  0.970171
4 4  1.000000

6.12.2 Variable Importances

While total immigrants, houses, university diploma education, average child per household, # of private households are the most important variables, they are not doing a very good job of predicting segment membership.

fs3Cov.class_rf@model$variable_importances
Variable Importances: 
                 variable relative_importance scaled_importance percentage
1                 Tot_Mig       212571.953125          1.000000   0.242930
2                  Houses        93486.242188          0.439786   0.106837
3        Uni_Dip_Blw_Bach        82943.546875          0.390190   0.094789
4  Avg_Chld_PrCen_Fm_Hhld        71780.531250          0.337676   0.082032
5                Pvt_HHld        66663.859375          0.313606   0.076184
6                 NF_Hhld        62827.562500          0.295559   0.071800
7                Fam_Hhld        54388.355469          0.255859   0.062156
8                    Empl        46267.671875          0.217657   0.052875
9                     Apt        44258.714844          0.208206   0.050579
10                Pop_Age        30073.251953          0.141473   0.034368
11              F_Pop_Age        25027.115234          0.117735   0.028601
12              M_Pop_Age        23832.818359          0.112116   0.027236
13            Hhld.Income        20380.693359          0.095877   0.023291
14            High.School        15530.393555          0.073059   0.017748
15  Clg_CEGEP_Non_Uni_Dip        12773.874023          0.060092   0.014598
16                 Condos        10440.803711          0.049117   0.011932
17                  UnEmp         1786.749756          0.008405   0.002042

6.12.3 The diagnostic statistics

The table below shows the hit-ratio, MSE (mean square error), RMSE (root mean square error), logloss and mean-per-class-error for the random forest model. This will be compared to the other models further along in this document.

# THE FOLLOWING covH data frame IS TO HOLD STATISTICS FROM EACH MODEL
covH <- data.frame(Prediction_model=character(),
                   hit_ratio=numeric(),
                   MSE=numeric(),
                   RMSE=numeric(),
                   logloss=numeric(),
                   mean_per_class_error=numeric(),
                   stringsAsFactors=FALSE)

covH[1, 1] <- "Random_forest"
covH[1, 2] <- fs3Cov.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[1, 3] <- fs3Cov.class_rf@model$validation_metrics@metrics$MSE   #  
covH[1, 4] <- fs3Cov.class_rf@model$validation_metrics@metrics$RMSE       #  
covH[1, 5] <- fs3Cov.class_rf@model$validation_metrics@metrics$ logloss
covH[1, 6] <- fs3Cov.class_rf@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE     RMSE   logloss mean_per_class_error
1    Random_forest 0.6360936 0.2957424 0.543822 0.8446302            0.3783833

6.13 Level 2: Logistic regression

Logistic regression using GLM is the second predictive model that will be investigated.

fs3Cov.class_glm<- h2o.glm(
  family= "multinomial",  
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=18:34,                        ## the predictor columns, by column index
  y=2,
  lambda=0
)

6.13.1 Confusion Matrices

Training Sample

The error rate is quite poor on the training sample for the random forest model, roughly 37%.

h2o.confusionMatrix(fs3Cov.class_glm,  fs.split[[1]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error               Rate
Seg1   12908  7228  4835  4046 0.5552 =  16,109 / 29,017
Seg2    3820 33332  4145 12539 0.3809 =  20,504 / 53,836
Seg3    2244  3721 33868  4035 0.2280 =  10,000 / 43,868
Seg4    2403 12503  4536 30832 0.3867 =  19,442 / 50,274
Totals 21375 56784 47384 51452 0.3732 = 66,055 / 176,995

Testing Sample

The holdout sample error rate is almost as worse as that of the training sample. It is 37.19% for the training sample.

h2o.confusionMatrix(fs3Cov.class_glm,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   5543  3146  2111  1752 0.5584 =  7,009 / 12,552
Seg2   1675 14121  1784  5358 0.3844 =  8,817 / 22,938
Seg3    908  1525 14483  1811 0.2266 =  4,244 / 18,727
Seg4    966  5407  1972 13169 0.3879 =  8,345 / 21,514
Totals 9092 24199 20350 22090 0.3752 = 28,415 / 75,731

Of course the hit-ratios, which are just 1 minus the error rates, are quite poor. 62.81% for the training sample.

h2o.hit_ratio_table(fs3Cov.class_glm, train=TRUE, valid=TRUE )
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.626797
2 2  0.872663
3 3  0.967745
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.624790
2 2  0.870581
3 3  0.967266
4 4  1.000000

6.13.2 Diagnostic statistics for the GLM model

The table below shows that the diagnostic statistics for the random forest model and logistic regression model. So far, the Random Forest model looks overall better, but not good enough to present to a client.

covH[2, 1] <- "GLM"
covH[2, 2] <- fs3Cov.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[2, 3] <- fs3Cov.class_glm@model$validation_metrics@metrics$MSE   #  
covH[2, 4] <- fs3Cov.class_glm@model$validation_metrics@metrics$RMSE       #  
covH[2, 5] <- fs3Cov.class_glm@model$validation_metrics@metrics$ logloss
covH[2, 6] <- fs3Cov.class_glm@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE     RMSE   logloss mean_per_class_error
1    Random_forest 0.6360936 0.2957424 0.543822 0.8446302            0.3783833
2              GLM 0.6247904 0.3068551 0.553945 0.8733787            0.3893232

6.14 Level 2: Deep Learning Model

The basic deep learning model has a very simple structure that can be expanded.

fs3Cov.class_dl<- h2o.deeplearning(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=18:34,                        ## the predictor columns, by column index
  y=2
)

6.14.1 Confusion Matrices

Training Sample

The training sample error rate is 36.67% however, this does not really mean anything as we care more about the testing sample.

fs3Cov.class_dl@model$training_metrics  
H2OMultinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9912 samples **

Training Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.2885144
RMSE: (Extract with `h2o.rmse`) 0.5371354
Logloss: (Extract with `h2o.logloss`) 0.8378529
Mean Per-Class Error: 0.3821669
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4  Error            Rate
Seg1    650  239  385  327 0.5940 =   951 / 1,601
Seg2    200 1556  328  926 0.4831 = 1,454 / 3,010
Seg3     55   61 2091  279 0.1589 =   395 / 2,486
Seg4     74  464  286 1991 0.2927 =   824 / 2,815
Totals  979 2320 3090 3523 0.3656 = 3,624 / 9,912

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.634383
2 2  0.877018
3 3  0.970742
4 4  1.000000

Testing Sample

Unfortunately, the validation sample error rate is in the same ball park as the other models for level 2 segmentation, at roughly 37.65%.

h2o.confusionMatrix(fs3Cov.class_dl,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   4929  1965  3173  2485 0.6073 =  7,623 / 12,552
Seg2   1309 11545  2837  7247 0.4967 = 11,393 / 22,938
Seg3    402   477 15763  2085 0.1583 =  2,964 / 18,727
Seg4    612  3581  2322 14999 0.3028 =  6,515 / 21,514
Totals 7252 17568 24095 26816 0.3763 = 28,495 / 75,731

Again, the hit ratios are very poor. The training sample hit ratio is 62.35%.

h2o.hit_ratio_table(fs3Cov.class_dl, train=TRUE, valid=TRUE )
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.634383
2 2  0.877018
3 3  0.970742
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.623734
2 2  0.873011
3 3  0.969035
4 4  1.000000

6.14.2 Diagnostic statistics

The diagnostics statistics table indicates that the Random Forest model is still the best performer in a very dismal race.

covH[3, 1] <- "Deep_Learning"
covH[3, 2] <- fs3Cov.class_dl@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[3, 3] <- fs3Cov.class_dl@model$validation_metrics@metrics$MSE   #  
covH[3, 4] <- fs3Cov.class_dl@model$validation_metrics@metrics$RMSE       #  
covH[3, 5] <- fs3Cov.class_dl@model$validation_metrics@metrics$ logloss
covH[3, 6] <- fs3Cov.class_dl@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE   logloss mean_per_class_error
1    Random_forest 0.6360936 0.2957424 0.5438220 0.8446302            0.3783833
2              GLM 0.6247904 0.3068551 0.5539450 0.8733787            0.3893232
3    Deep_Learning 0.6237340 0.2963624 0.5443918 0.8637924            0.3912751

6.15 Level 2: Gradient Boosting Model

The basic gradient boosting model, using neural networks, is specified below.

fs3Cov.class_gbm<- h2o.gbm(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=18:34,                        ## the predictor columns, by column index
  y=2
)

6.15.1 Confusion Matrices

Training Sample

The high error rate shown below for the training sample is similar to that of the previous models, hovering in the 35.97% range.

h2o.confusionMatrix(fs3Cov.class_gbm,  fs.split[[1]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error               Rate
Seg1   13353  6628  4933  4103 0.5398 =  15,664 / 29,017
Seg2    3384 34681  4168 11603 0.3558 =  19,155 / 53,836
Seg3    2017  3145 34653  4053 0.2101 =   9,215 / 43,868
Seg4    2237 12725  4421 30891 0.3855 =  19,383 / 50,274
Totals 20991 57179 48175 50650 0.3583 = 63,417 / 176,995

Testing Sample

However, the holdout sample error rate is 36.17%, which is again in the same ball park as the previous models.

h2o.confusionMatrix(fs3Cov.class_gbm,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   5639  2931  2189  1793 0.5507 =  6,913 / 12,552
Seg2   1455 14727  1782  4974 0.3580 =  8,211 / 22,938
Seg3    833  1306 14738  1850 0.2130 =  3,989 / 18,727
Seg4    906  5561  1927 13120 0.3902 =  8,394 / 21,514
Totals 8833 24525 20636 21737 0.3632 = 27,507 / 75,731

The hit ratio for the testing sample is 63.83%.

h2o.hit_ratio_table(fs3Cov.class_gbm, train=TRUE, valid=TRUE )
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.641702
2 2  0.881629
3 3  0.970705
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.636780
2 2  0.878768
3 3  0.970501
4 4  1.000000

6.15.2 Diagnostic statistics

The GBM model is similar to Random Forest in terms of performance across the five metrics. Again, we wouldn’t write home about it.

covH[4, 1] <- "GBM_Boosting"
covH[4, 2] <- fs3Cov.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[4, 3] <- fs3Cov.class_gbm@model$validation_metrics@metrics$MSE   #  
covH[4, 4] <- fs3Cov.class_gbm@model$validation_metrics@metrics$RMSE       #  
covH[4, 5] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ logloss
covH[4, 6] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE   logloss mean_per_class_error
1    Random_forest 0.6360936 0.2957424 0.5438220 0.8446302            0.3783833
2              GLM 0.6247904 0.3068551 0.5539450 0.8733787            0.3893232
3    Deep_Learning 0.6237340 0.2963624 0.5443918 0.8637924            0.3912751
4     GBM_Boosting 0.6367802 0.3010183 0.5486513 0.8505309            0.3779716

6.16 Level 2: Naive Bayes Model

Finally, we conducted a Naive Bayes Model.

fs3Cov.class_nB <- h2o.naiveBayes(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=18:34,                        ## the predictor columns, by column index
  y=2,
  laplace = 3)

6.16.1 Confusion Matrices

Training Sample

The Naive Bayes model error rate for the training sample is the highest one we’ve seen thus far for traingin samples, at 40.55%.

h2o.confusionMatrix(fs3Cov.class_nB,  fs.split[[1]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
        Seg1  Seg2  Seg3  Seg4  Error               Rate
Seg1   11363  6562  7201  3891 0.6084 =  17,654 / 29,017
Seg2    4115 28389  7269 14063 0.4727 =  25,447 / 53,836
Seg3    1736  3057 34845  4230 0.2057 =   9,023 / 43,868
Seg4    2754 10222  6587 30711 0.3891 =  19,563 / 50,274
Totals 19968 48230 55902 52895 0.4050 = 71,687 / 176,995

Testing Sample

Similarly, the Naive Bayes model error rate for the testing sample is the highest one we’ve seen thus far for testing samples, at 40.45%.

h2o.confusionMatrix(fs3Cov.class_nB,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1  Seg2  Seg3  Seg4  Error              Rate
Seg1   4825  2830  3184  1713 0.6156 =  7,727 / 12,552
Seg2   1764 12082  3105  5987 0.4733 = 10,856 / 22,938
Seg3    752  1230 14855  1890 0.2068 =  3,872 / 18,727
Seg4   1126  4427  2770 13191 0.3869 =  8,323 / 21,514
Totals 8467 20569 23914 22781 0.4064 = 30,778 / 75,731

The hit ratio for the testing sample is 59.55%, which is pretty poor.

h2o.hit_ratio_table(fs3Cov.class_nB, train=TRUE, valid=TRUE )
$train
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.594977
2 2  0.847035
3 3  0.959733
4 4  1.000000

$valid
Top-4 Hit Ratios: 
  k hit_ratio
1 1  0.593588
2 2  0.846021
3 3  0.958511
4 4  1.000000

6.16.2 Comparing the Five Predictive Models’ Diagnostic Statistics

covH[5, 1] <- "naive_Bayes"
covH[5, 2] <- fs3Cov.class_nB@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[5, 3] <- fs3Cov.class_nB@model$validation_metrics@metrics$MSE   #  
covH[5, 4] <- fs3Cov.class_nB@model$validation_metrics@metrics$RMSE       #  
covH[5, 5] <- fs3Cov.class_nB@model$validation_metrics@metrics$ logloss
covH[5, 6] <- fs3Cov.class_nB@model$validation_metrics@metrics$ mean_per_class_error

The Random Forest model is slightly superior than the other models overall, however, the hit ratio is nothing to be proud of. The Naive Bayes model seems to be the worst model.

pacman::p_load(sjPlot)
tab_df(covH, sort.column = -2, show.rownames = FALSE, digits = 3,
       title = "Statistics of five Level 2 models of the 6-segment solution")
Statistics of five Level 2 models of the 6-segment solution
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
GBM_Boosting 0.637 0.301 0.549 0.851 0.378
Random_forest 0.636 0.296 0.544 0.845 0.378
GLM 0.625 0.307 0.554 0.873 0.389
Deep_Learning 0.624 0.296 0.544 0.864 0.391
naive_Bayes 0.594 0.328 0.572 1.024 0.421

6.16.3 Graphically Comparing The Hit Ratios

As you can see the hit ratios are all quite similar and poor.

pacman::p_load(reshape2)
# covH[,1:2] 
hits_long<- melt(covH[,1:2] ) # need to reshape to 'long' form
# head(hits_long )
pacman::p_load(ggplot2)
ggplot(data=hits_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
  geom_bar(  stat="identity" , fill = "pink", width = 0.3) +
  geom_point(aes( color= Prediction_model ), size=3) +
  theme(axis.text.x = element_text(angle = 0, hjust = 1, size=10)) +
  labs( title= "Hit ratios for predictive models using covariates" , 
        y= "Proportion correct", x= "Prediction Model") +
  ylim(0, 1)

6.16.4 Obtaining the segment assignments using the random forest model

Here we are predicting each respondents’ segment based on the fs3Cov.class_rf model (demographic varibales only).

cov.assign.hex = h2o.predict(fs3Cov.class_rf, fs4C.class[,18:34]) 
#cov.assign.hex<- as.factor(cov.assign.hex) # cluster 'names' must be factors for modeling
h2o.print(cov.assign.hex$predict, n = 10L)
h2o.table(cov.assign.hex$predict) # table of assignments over all respondents

6.17 Combining L1 segments and L2 segments

The segments developed from the 4-segment k-means solution and those predicted based on the demographic variables are combined into a single dataframe below and then saved.

fs4.L1.L2.segs <- h2o.cbind(fs4C.class[,c(1,2)],  cov.assign.hex$predict, fs4C.class[,3:34])
#fs6.L1.L2.segs
colnames(fs4.L1.L2.segs)[ c(2,3)] <- c("L1_Segments", "L2_Segments")
fs4.L1.L2.segs[1:6, 1:5]
   Confidence L1_Segments L2_Segments Tech.Enthu. Ann_Prem_Hm_Insu
1 0.009426962        Seg2        Seg2 0.009739573      0.004403856
2 0.025949382        Seg2        Seg2 0.014750739      0.005865521
3 0.025238024        Seg2        Seg2 0.029873820      0.020744546
4 0.031883370        Seg2        Seg2 0.024817610      0.018098652
5 0.030433902        Seg2        Seg2 0.028479670      0.009048419
6 0.056125319        Seg2        Seg2 0.061331941      0.063795120

[6 rows x 5 columns] 

This table shows the coincidence and divergence of prediction using the two models.

print( h2o.table(fs4.L1.L2.segs$L1_Segments,  fs4.L1.L2.segs$L2_Segments), n=36L)
   L1_Segments L2_Segments Counts
1         Seg1        Seg1  20923
2         Seg1        Seg2   8626
3         Seg1        Seg3   6702
4         Seg1        Seg4   5318
5         Seg2        Seg1   3995
6         Seg2        Seg2  51561
7         Seg2        Seg3   5551
8         Seg2        Seg4  15667
9         Seg3        Seg1   2383
10        Seg3        Seg2   3846
11        Seg3        Seg3  51096
12        Seg3        Seg4   5270
13        Seg4        Seg1   2575
14        Seg4        Seg2  16415
15        Seg4        Seg3   6171
16        Seg4        Seg4  46627

[16 rows x 3 columns] 
h2o.exportFile(fs4.L1.L2.segs, "C:/Jaspreet Marketing Analytics Project/FS.4Seg.L1.L2_16dec20.csv") # save the results

6.18 Visualizing & Summarizing customers in segments

6.18.1 Table of Attribute Attitudes and Segments

The table below lists the means for each attitudinal variable for those in each of the 4 Level 1 Segments.

Top 3 most important variables in each segment:

Segment 1 Top 3:

  • Trust Financial/Banking Applications
  • Technology Enthusiast Per Household
  • Consent in sharing Mobile Information per HouseHold

Segment 2 Top 3:

  • Trust Financial/Banking Applications
  • Annual Premiums Paid for Home Insurance per HouseHold
  • Technology Enthusiast People Per Household

Segment 3 Top 3:

  • Trust Financial/Banking Applications
  • Technology Enthusiast Per Household
  • Smart Thermostat Owners per HouseHold & Confidence in Big Business per HouseHold

Segment 4 Top 3:

  • Trust Financial/Banking Applications
  • Annual Premiums Paid for Home Insurance Per HouseHold
  • Smart Home Assistant Owners per HouseHold
library("dplyr")
library("kableExtra")
Customer_Data <- read.csv("C:/Jaspreet Marketing Analytics Project/FS.4Seg.L1.L2_16dec20.csv")
library(psych)
library(data.table)

Customer_Data %>% 
  mutate(Group = as.factor(L1_Segments)) %>%
  group_by(Group) %>%
  summarize(Avg_er = round(mean(Confidence),2), 
            Avg_ps = round(mean(Tech.Enthu.),2), 
            Avg_fo = round(mean(Ann_Prem_Hm_Insu),2),
            Avg_tp = round(mean(Mob_Mark_cons),2),
            Avg_sc = round(mean(Mob_Inf_sec_cons),2),
            Avg_sr = round(mean(X..Smart_purch),2),
            Avg_os = round(mean(X.Smt_Therm),2),
            Avg_ca = round(mean(X.Smt_Hm_Secu),2),
            Avg_c = round(mean(X.Smt_Hm_Ass),2),
            Avg_f = round(mean(Smt_Hm_Dev),2),
            Avg_h = round(mean(X.Smat_Therm.Pln),2),
            Avg_i = round(mean(X.Smt_Hm_Sec.Pln),2),
            Avg_j = round(mean(X.Smt_Hm_Asst.Pln),2),
            Avg_k = round(mean(X.Smt_Hm_Dev.Pln),2),
            Avg_l = round(mean(Fin_App_Trst),2),
            Count_of_Members = n()
  ) %>%
  arrange(Group) %>% 
  transpose() -> cd

colnames(cd) <-  cd[1,] 
cd <- cd[-1,]
cd$order <- 1:nrow(cd)
cd$order
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
rownames(cd)[1:15] <- colnames(Customer_Data[c(1,4:17)])
rownames(cd)[16] <- c("Segment_Size")
# cd$variable <- rownames(cd)
cd$variable <- c("Confidence in Big Business per HouseHold- RBC",
                 "Technology Enthusiast People per HouseHold",
                 "Annual Premiums Paid for Home Insurance per HouseHold",
                 "Consent in Mobile Marketing per HouseHold",
                 "Consent in sharing Mobile Information per HouseHold",
                 "Per HouseHold people purchasing through smartphones",
                 "Smart Thermostat Owners per HouseHold",
                 "Smart Home Security Owners per HouseHold",
                 "Smart Home Assistant Owners per HouseHold",
                 "Smart Home Devices Owners per HouseHold",
                 "People planning to buy Smart Thermostat per HouseHold",
                 "People planning to buy Smart Home Security devices per HouseHold",
                 "People planning to buy Smart Home Assistant per HouseHold",
                 "People planning to buy Smart Home Devices per HouseHold",
                 "People having Financial Application Trust per HouseHold",
                 "Segment Size")
cd <- cd[, c(6, 1:5)]
cd[,2:5] <- lapply(cd[,2:5], function(x) as.numeric(as.character(x)))
cd[1:15, 1:5] %>% 
  arrange(variable) %>% 
  mutate_if(is.numeric, function(x) {
    cell_spec(x, bold = T, 
              color = spec_color(x, end = 0.9),
              font_size = spec_font_size(x))
  }) %>%
  kable(escape = F, align = "c") %>%
  kable_styling(c("striped", "condensed"), full_width = T, position = "left")
variable Seg1 Seg2 Seg3 Seg4
Annual Premiums Paid for Home Insurance per HouseHold 0.16 0.22 0.18 0.29
Confidence in Big Business per HouseHold- RBC 0.2 0.2 0.27 0.23
Consent in Mobile Marketing per HouseHold 0.14 0.12 0.15 0.13
Consent in sharing Mobile Information per HouseHold 0.21 0.16 0.19 0.19
People having Financial Application Trust per HouseHold 0.46 0.39 0.43 0.47
People planning to buy Smart Home Assistant per HouseHold 0.08 0.06 0.11 0.08
People planning to buy Smart Home Devices per HouseHold 0.1 0.08 0.13 0.1
People planning to buy Smart Home Security devices per HouseHold 0.1 0.09 0.16 0.11
People planning to buy Smart Thermostat per HouseHold 0.09 0.08 0.11 0.1
Per HouseHold people purchasing through smartphones 0.19 0.11 0.18 0.13
Smart Home Assistant Owners per HouseHold 0.19 0.17 0.26 0.21
Smart Home Devices Owners per HouseHold 0.11 0.09 0.16 0.12
Smart Home Security Owners per HouseHold 0.08 0.06 0.13 0.07
Smart Thermostat Owners per HouseHold 0.15 0.15 0.27 0.19
Technology Enthusiast People per HouseHold 0.26 0.18 0.25 0.21

Although the L2 segmentation wasn’t that impressive, we went ahead and looked at segments in further granularity to see if we could pick anything up that it interesting and could add value.

6.18.2 Correspondence Analysis of Attributes and Segments

library(FactoMineR)
cd.m <- cd[1:15, 2:5]
cd.m
                  Seg1 Seg2 Seg3 Seg4
Confidence        0.20 0.20 0.27 0.23
Tech.Enthu.       0.26 0.18 0.25 0.21
Ann_Prem_Hm_Insu  0.16 0.22 0.18 0.29
Mob_Mark_cons     0.14 0.12 0.15 0.13
Mob_Inf_sec_cons  0.21 0.16 0.19 0.19
X..Smart_purch    0.19 0.11 0.18 0.13
X.Smt_Therm       0.15 0.15 0.27 0.19
X.Smt_Hm_Secu     0.08 0.06 0.13 0.07
X.Smt_Hm_Ass      0.19 0.17 0.26 0.21
Smt_Hm_Dev        0.11 0.09 0.16 0.12
X.Smat_Therm.Pln  0.09 0.08 0.11 0.10
X.Smt_Hm_Sec.Pln  0.10 0.09 0.16 0.11
X.Smt_Hm_Asst.Pln 0.08 0.06 0.11 0.08
X.Smt_Hm_Dev.Pln  0.10 0.08 0.13 0.10
Fin_App_Trst      0.46 0.39 0.43 0.47
c <- CA(cd.m, graph=FALSE)
plot(c, title="Correspondence Analysis of Attributes and Segments", col.main="blue" )

6.18.3 Interpreting the Segments Using Demographic Data

The several tables that follow provide the row and column percents for each level of each demographic for the 4 Level 1 segments. While prediction using the demographic variables has been shown to be very poor, there may be some insights that could help to better describe the segments.

The Chi-square statistics beneath each table provide some indication of whether the table shows any significant relationship between the variables and the segments.

The blue/purple numbers represent the across-segment percentages The green numbers represent the within-segment percentages.

6.18.4 Segments by Total Population Age

  • Segment 1: 96% of them are within the age bracket of 26-64.
  • Segment 2: 58% of them are within the age bracket of 40-64.
  • Segment 3: 52% of them are within the age bracket of 40-64.
  • Segment 4: Skewing older…78.3% are 40-64.
library(sjPlot)
sjt.xtab(Customer_Data$Pop_Age, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Pop_Age L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25 Years 864
18.4 %
2.1 %
2827
60.2 %
3.7 %
773
16.5 %
1.2 %
234
5 %
0.3 %
4698
100 %
1.9 %
26-50 Years 36608
18 %
88.1 %
62481
30.8 %
81.4 %
55280
27.2 %
88.3 %
48713
24 %
67.9 %
203082
100 %
80.4 %
51-75 Years 4084
9.2 %
9.8 %
11165
25.2 %
14.5 %
6471
14.6 %
10.3 %
22573
51 %
31.4 %
44293
100 %
17.5 %
More than 76 Years 13
2 %
0 %
301
46.1 %
0.4 %
71
10.9 %
0.1 %
268
41 %
0.4 %
653
100 %
0.3 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=16429.159 · df=9 · Cramer’s V=0.147 · p=0.000

6.18.5 Segments by Male Population Age

sjt.xtab(Customer_Data$M_Pop_Age, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
M_Pop_Age L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25 Years 1472
13.7 %
3.5 %
5272
49 %
6.9 %
1803
16.8 %
2.9 %
2207
20.5 %
3.1 %
10754
100 %
4.3 %
26-50 Years 36231
18.2 %
87.2 %
59873
30 %
78 %
54143
27.1 %
86.5 %
49201
24.7 %
68.5 %
199448
100 %
78.9 %
51-75 Years 3837
9.2 %
9.2 %
11346
27.2 %
14.8 %
6543
15.7 %
10.5 %
20037
48 %
27.9 %
41763
100 %
16.5 %
More than 76 Years 29
3.8 %
0.1 %
283
37.2 %
0.4 %
106
13.9 %
0.2 %
343
45.1 %
0.5 %
761
100 %
0.3 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=12301.018 · df=9 · Cramer’s V=0.127 · p=0.000

6.18.6 Segments by Female Population Age

#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$F_Pop_Age <- fct_relevel(Customer_Data$F_Pop_Age)
levels(Customer_Data$F_Pop_Age)
[1] "0-25 Years"         "26-50 Years"        "51-75 Years"       
[4] "More than 76 Years"
sjt.xtab(Customer_Data$F_Pop_Age, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
F_Pop_Age L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25 Years 5284
17.5 %
12.7 %
10925
36.2 %
14.2 %
5898
19.6 %
9.4 %
8057
26.7 %
11.2 %
30164
100 %
11.9 %
26-50 Years 29364
18 %
70.6 %
48605
29.8 %
63.3 %
47324
29 %
75.6 %
37741
23.1 %
52.6 %
163034
100 %
64.5 %
51-75 Years 6199
11.5 %
14.9 %
15377
28.4 %
20 %
8427
15.6 %
13.5 %
24110
44.6 %
33.6 %
54113
100 %
21.4 %
More than 76 Years 722
13.3 %
1.7 %
1867
34.5 %
2.4 %
946
17.5 %
1.5 %
1880
34.7 %
2.6 %
5415
100 %
2.1 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=11718.907 · df=9 · Cramer’s V=0.124 · p=0.000

6.18.7 Segments by Private Household

Overall, it appears that the majority of people in each segments live in 3-4 family households.

  • Segment 1: 60% of this segment lives in 3-4 person households.
  • Segment 2: 80% of this segment lives in 3-4 person households.
  • Segment 3: 72.3% of this segment lives in 3-4 person households.
  • Segment 4: 73.1% of this segment lives in 3-4 person households. Roughly half of the 0-2 person households who were surveyed fall into Segment 4.
#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$Pvt_HHld <- fct_relevel(Customer_Data$Pvt_HHld)
levels(Customer_Data$Pvt_HHld)
[1] "0-2 Persons Private Households" "3-4 Persons Private Households"
[3] "5-6 Persons Private Households" "7-8 Persons Private Households"
[5] "8+ Persons Private Households" 
sjt.xtab(Customer_Data$Pvt_HHld, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Pvt_HHld L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-2 Persons Private
Households
10220
26.1 %
24.6 %
9157
23.4 %
11.9 %
542
1.4 %
0.9 %
19257
49.2 %
26.8 %
39176
100 %
15.5 %
3-4 Persons Private
Households
24947
13.5 %
60 %
61598
33.4 %
80.2 %
45277
24.6 %
72.3 %
52490
28.5 %
73.1 %
184312
100 %
72.9 %
5-6 Persons Private
Households
4461
20.6 %
10.7 %
4698
21.7 %
6.1 %
12416
57.4 %
19.8 %
41
0.2 %
0.1 %
21616
100 %
8.6 %
7-8 Persons Private
Households
1079
25.5 %
2.6 %
1039
24.6 %
1.4 %
2113
49.9 %
3.4 %
0
0 %
0 %
4231
100 %
1.7 %
8+ Persons Private
Households
862
25.4 %
2.1 %
282
8.3 %
0.4 %
2247
66.3 %
3.6 %
0
0 %
0 %
3391
100 %
1.3 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=41657.517 · df=12 · Cramer’s V=0.234 · p=0.000

6.18.8 Segments by Family Household

Unfortunately, this variable did not yield helpful results that can be interpreted. It is on a per-postal-code basis and if we divided by the total population, the numbers were less than 1.

#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$Fam_Hhld <- fct_relevel(Customer_Data$Fam_Hhld)
levels(Customer_Data$Fam_Hhld)
[1] "0-5 Family Households"   "11-15 Family Households"
[3] "16-20 Family Households" "21-25 Family Households"
[5] "26-30 Family Households" "31-35 Family Households"
[7] "36-40 Family Households" "40+ Family Households"  
[9] "6-10 Family Households" 
sjt.xtab(Customer_Data$Fam_Hhld, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Fam_Hhld L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-5 Family
Households
19649
19.9 %
47.3 %
30264
30.7 %
39.4 %
19895
20.2 %
31.8 %
28836
29.2 %
40.2 %
98644
100 %
39 %
11-15 Family
Households
4200
13.9 %
10.1 %
9030
29.8 %
11.8 %
7997
26.4 %
12.8 %
9031
29.8 %
12.6 %
30258
100 %
12 %
16-20 Family
Households
2418
15.7 %
5.8 %
4525
29.4 %
5.9 %
4194
27.2 %
6.7 %
4272
27.7 %
6 %
15409
100 %
6.1 %
21-25 Family
Households
1559
13 %
3.8 %
3496
29.2 %
4.6 %
3766
31.5 %
6 %
3132
26.2 %
4.4 %
11953
100 %
4.7 %
26-30 Family
Households
1127
10.3 %
2.7 %
2990
27.3 %
3.9 %
3981
36.4 %
6.4 %
2848
26 %
4 %
10946
100 %
4.3 %
31-35 Family
Households
764
8 %
1.8 %
2101
22.1 %
2.7 %
4705
49.4 %
7.5 %
1950
20.5 %
2.7 %
9520
100 %
3.8 %
36-40 Family
Households
543
10.2 %
1.3 %
941
17.7 %
1.2 %
3131
58.8 %
5 %
707
13.3 %
1 %
5322
100 %
2.1 %
40+ Family
Households
2239
21 %
5.4 %
4235
39.7 %
5.5 %
1809
16.9 %
2.9 %
2395
22.4 %
3.3 %
10678
100 %
4.2 %
6-10 Family
Households
9070
15.1 %
21.8 %
19192
32 %
25 %
13117
21.9 %
21 %
18617
31 %
25.9 %
59996
100 %
23.7 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=10856.938 · df=24 · Cramer’s V=0.120 · p=0.000

6.18.9 Segments by Non-Family Household

Unfortunately, this variable did not yield helpful results that can be interpreted. It is on a per-postal-code basis and if we divided by the total population, the numbers were less than 1.

sjt.xtab(Customer_Data$NF_Hhld, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
NF_Hhld L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-1 Non-Family
Households
11028
12.3 %
26.5 %
24734
27.6 %
32.2 %
31633
35.3 %
50.5 %
22204
24.8 %
30.9 %
89599
100 %
35.5 %
2-3 Non-Family
Households
8666
11.6 %
20.8 %
22434
30 %
29.2 %
19544
26.2 %
31.2 %
24030
32.2 %
33.5 %
74674
100 %
29.5 %
4-5 Non-Family
Households
5804
16.3 %
14 %
11388
31.9 %
14.8 %
6783
19 %
10.8 %
11741
32.9 %
16.4 %
35716
100 %
14.1 %
6-7 Non-Family
Households
3866
22.3 %
9.3 %
5621
32.5 %
7.3 %
2348
13.6 %
3.8 %
5473
31.6 %
7.6 %
17308
100 %
6.8 %
9+ Non-Family
Households
12205
34.4 %
29.4 %
12597
35.6 %
16.4 %
2287
6.5 %
3.7 %
8340
23.5 %
11.6 %
35429
100 %
14 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=21567.022 · df=12 · Cramer’s V=0.169 · p=0.000

6.18.10 Segments by Average Child per census family household

Overall, 50% of households have 2 children across all segments.

  • Segment 1: 90% of this segment has 1-2 children/ household.
  • Segment 3: 75.2% of this segment has 2 children/ household.
  • Segment 4: 68.5% of this segment has 1 children/ household.
sjt.xtab(Customer_Data$Avg_Chld_PrCen_Fm_Hhld, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Avg_Chld_PrCen_Fm_Hhld L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-1 Children Per
Household
20601
18.1 %
49.6 %
31119
27.4 %
40.5 %
11154
9.8 %
17.8 %
50720
44.7 %
70.7 %
113594
100 %
44.9 %
2 Children Per
Household
18401
14.6 %
44.3 %
39694
31.5 %
51.7 %
47077
37.3 %
75.2 %
20897
16.6 %
29.1 %
126069
100 %
49.9 %
3 Children Per
Household
2217
19.9 %
5.3 %
5091
45.8 %
6.6 %
3644
32.8 %
5.8 %
166
1.5 %
0.2 %
11118
100 %
4.4 %
4 Children Per
Household
304
17.8 %
0.7 %
827
48.4 %
1.1 %
571
33.5 %
0.9 %
5
0.3 %
0 %
1707
100 %
0.7 %
5 or More Children
Per Household
46
19.3 %
0.1 %
43
18.1 %
0.1 %
149
62.6 %
0.2 %
0
0 %
0 %
238
100 %
0.1 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=40909.129 · df=12 · Cramer’s V=0.232 · p=0.000

6.18.11 Segments by Houses

  • Segment 1: Lowest likelihood of owning a house.
  • Segment 2: Second highest likelihood of owning a house.
  • Segment 4: Has the highest likelihood of owning a house.
library(forcats)
Customer_Data$Houses <- fct_relevel(as.factor(Customer_Data$Houses))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Houses, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Houses L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% Living in
Houses
29500
31.8 %
71 %
22999
24.8 %
30 %
31883
34.4 %
50.9 %
8303
9 %
11.6 %
92685
100 %
36.7 %
26-50% Living in
Houses
12007
7.7 %
28.9 %
53300
34.2 %
69.4 %
30708
19.7 %
49.1 %
59946
38.4 %
83.5 %
155961
100 %
61.7 %
51-75% Living in
Houses
60
1.6 %
0.1 %
468
12.6 %
0.6 %
4
0.1 %
0 %
3191
85.7 %
4.4 %
3723
100 %
1.5 %
More than 76%
Living in Houses
2
0.6 %
0 %
7
2 %
0 %
0
0 %
0 %
348
97.5 %
0.5 %
357
100 %
0.1 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=52175.616 · df=9 · Cramer’s V=0.262 · p=0.000

6.18.12 Segments by Apartments

  • Segment 1: Most likely to rent an apartment.
  • Segment 2: Extremely unlikely to rent an apartment.
  • Segment 3: Extremely unlikely to rent an apartment.
  • Segment 4: Extremely unlikely to rent an apartment.
library(forcats)
Customer_Data$Apt <- fct_relevel(as.factor(Customer_Data$Apt))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Apt, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Apt L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% Living in
Apartments
28628
12.5 %
68.9 %
71037
31.1 %
92.5 %
61947
27.1 %
99 %
66670
29.2 %
92.9 %
228282
100 %
90.3 %
26-50% Years Living
in Apartments
9468
56.9 %
22.8 %
4049
24.3 %
5.3 %
325
2 %
0.5 %
2797
16.8 %
3.9 %
16639
100 %
6.6 %
51-75% Living in
Apartments
3170
47.4 %
7.6 %
1527
22.8 %
2 %
261
3.9 %
0.4 %
1736
25.9 %
2.4 %
6694
100 %
2.6 %
More than 76%
Living in Apartments
303
27.3 %
0.7 %
161
14.5 %
0.2 %
62
5.6 %
0.1 %
585
52.7 %
0.8 %
1111
100 %
0.4 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=29540.281 · df=9 · Cramer’s V=0.197 · p=0.000

6.18.13 Segments by Condos

  • Segment 1: Very unlikely to own/rent a condo.
  • Segment 2: Very unlikely to own/rent a condo.
  • Segment 3: Very unlikely to own/rent a condo.
  • Segment 4: Very unlikely to own/rent a condo.
library(forcats)
Customer_Data$Condos <- fct_relevel(as.factor(Customer_Data$Condos))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Condos, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Condos L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% Living in
Condos
40549
16.4 %
97.5 %
75443
30.5 %
98.3 %
61905
25.1 %
98.9 %
69108
28 %
96.3 %
247005
100 %
97.7 %
26-50% Years Living
in Condos
966
21.5 %
2.3 %
1263
28.2 %
1.6 %
418
9.3 %
0.7 %
1837
41 %
2.6 %
4484
100 %
1.8 %
51-75% Living in
Condos
54
4.8 %
0.1 %
68
6.1 %
0.1 %
246
22.1 %
0.4 %
747
67 %
1 %
1115
100 %
0.4 %
More than 76%
Living in Condos
0
0 %
0 %
0
0 %
0 %
26
21.3 %
0 %
96
78.7 %
0.1 %
122
100 %
0 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=1847.712 · df=9 · Cramer’s V=0.049 · p=0.000

6.18.14 Segments by Household Income

When we normalized the data by the population of each postal code, this is the output.

#library(forcats)
#Customer_Data$Hhld.Income <- fct_relevel(as.factor(Customer_Data$Hhld.Income))

#sjt.xtab(Customer_Data$Hhld.Income, Customer_Data$L1_Segments,
 #        show.row.prc = TRUE, show.col.prc = TRUE)

6.18.15 Segments by High School Certificate Or Equivalent

Most people in all segments are likely to have an education level higher than solely highs school diploma.

library(forcats)
Customer_Data$High.School <- fct_relevel(as.factor(Customer_Data$High.School))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$High.School, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
High.School L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% completed High
School
23471
15.5 %
56.5 %
42464
28.1 %
55.3 %
45233
29.9 %
72.3 %
40160
26.5 %
55.9 %
151328
100 %
59.9 %
26-50% completed
High School
17845
17.7 %
42.9 %
34246
34 %
44.6 %
17251
17.1 %
27.6 %
31398
31.2 %
43.7 %
100740
100 %
39.9 %
51-75% completed
High School
224
36.9 %
0.5 %
63
10.4 %
0.1 %
107
17.6 %
0.2 %
213
35.1 %
0.3 %
607
100 %
0.2 %
More than 76%
completed High
School
29
56.9 %
0.1 %
1
2 %
0 %
4
7.8 %
0 %
17
33.3 %
0 %
51
100 %
0 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=5643.164 · df=9 · Cramer’s V=0.086 · p=0.000

6.18.16 Segments by College, CEGEP Or Other Non-University Certificate Or Diploma

Most people in all segments are likely to have an education level higher than solely a college/CEGEP diploma.

library(forcats)
Customer_Data$Clg_CEGEP_Non_Uni_Dip <- fct_relevel(as.factor(Customer_Data$Clg_CEGEP_Non_Uni_Dip))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Clg_CEGEP_Non_Uni_Dip, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Clg_CEGEP_Non_Uni_Dip L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% completed
Non-University
Diploma
36392
17.2 %
87.5 %
62219
29.3 %
81 %
58918
27.8 %
94.1 %
54637
25.8 %
76.1 %
212166
100 %
84 %
26-50% completed
Non-University
Diploma
5166
12.8 %
12.4 %
14543
35.9 %
18.9 %
3658
9 %
5.8 %
17091
42.2 %
23.8 %
40458
100 %
16 %
51-75% completed
Non-University
Diploma
11
13.1 %
0 %
12
14.3 %
0 %
17
20.2 %
0 %
44
52.4 %
0.1 %
84
100 %
0 %
More than 76%
completed
Non-University
Diploma
0
0 %
0 %
0
0 %
0 %
2
11.1 %
0 %
16
88.9 %
0 %
18
100 %
0 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=9010.909 · df=9 · Cramer’s V=0.109 · Fisher’s p=0.000

6.18.17 Segments by University Certificate Or Diploma Below Bachelor

  • Segment 2: Is the least likely to to have a university degree as their highest level of education.
  • Segment 3: Is the most likely to have a university degree as their highest level of education. This is the most educated group.
library(forcats)
Customer_Data$Uni_Dip_Blw_Bach <- fct_relevel(as.factor(Customer_Data$Uni_Dip_Blw_Bach))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Uni_Dip_Blw_Bach, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Uni_Dip_Blw_Bach L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% completed
University Degree
22845
15.2 %
55 %
62315
41.4 %
81.2 %
21317
14.2 %
34.1 %
44108
29.3 %
61.4 %
150585
100 %
59.6 %
26-50% completed
University Degree
14368
16.4 %
34.6 %
13575
15.5 %
17.7 %
35135
40.2 %
56.1 %
24343
27.8 %
33.9 %
87421
100 %
34.6 %
51-75% completed
University Degree
4268
30.3 %
10.3 %
852
6.1 %
1.1 %
5834
41.4 %
9.3 %
3128
22.2 %
4.4 %
14082
100 %
5.6 %
More than 76%
completed University
Degree
88
13.8 %
0.2 %
32
5 %
0 %
309
48.4 %
0.5 %
209
32.8 %
0.3 %
638
100 %
0.3 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=34233.449 · df=9 · Cramer’s V=0.212 · p=0.000

6.18.18 Segments by Employed

  • Segment 3: Most likely to be employed.
  • Segment 4: Most likely to be employed.
library(forcats)
Customer_Data$UnEmp <- fct_relevel(as.factor(Customer_Data$Empl))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$UnEmp, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
UnEmp L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% Employed 1449
13.3 %
3.5 %
6079
55.9 %
7.9 %
786
7.2 %
1.3 %
2568
23.6 %
3.6 %
10882
100 %
4.3 %
26-50% Employed 15319
14.6 %
36.9 %
39809
38.1 %
51.9 %
23601
22.6 %
37.7 %
25884
24.7 %
36.1 %
104613
100 %
41.4 %
51-75% Employed 22440
17.7 %
54 %
30141
23.8 %
39.3 %
35055
27.6 %
56 %
39187
30.9 %
54.6 %
126823
100 %
50.2 %
More than 76%
Employed
2361
22.7 %
5.7 %
745
7.2 %
1 %
3153
30.3 %
5 %
4149
39.9 %
5.8 %
10408
100 %
4.1 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=12116.042 · df=9 · Cramer’s V=0.126 · p=0.000

6.18.19 Segments by UnEmployed

The data from this chart did not yield optimal results and was therefore not analyzed.

#library(forcats)
#Customer_Data$UnEmp <- fct_relevel(as.factor(Customer_Data$UnEmp))
# levels(Customer_Data$Income)
#sjt.xtab(Customer_Data$UnEmp, Customer_Data$L1_Segments,
         #show.row.prc = TRUE, show.col.prc = TRUE)

6.18.20 Segments by Total Immigrant

  • Segment 3: Majority are likely to be immigrants.
library(forcats)
Customer_Data$Tot_Mig <- fct_relevel(as.factor(Customer_Data$Tot_Mig))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Tot_Mig, Customer_Data$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Tot_Mig L1_Segments Total
Seg1 Seg2 Seg3 Seg4
0-25% Immigrants 25142
16.3 %
60.5 %
61611
40 %
80.2 %
12607
8.2 %
20.1 %
54480
35.4 %
75.9 %
153840
100 %
60.9 %
26-50% Immigrants 12990
19.4 %
31.2 %
13363
20 %
17.4 %
25005
37.4 %
39.9 %
15491
23.2 %
21.6 %
66849
100 %
26.5 %
51-75% Immigrants 3330
10.7 %
8 %
1724
5.5 %
2.2 %
24355
78.2 %
38.9 %
1750
5.6 %
2.4 %
31159
100 %
12.3 %
More than 76%
Immigrants
107
12.2 %
0.3 %
76
8.7 %
0.1 %
628
71.5 %
1 %
67
7.6 %
0.1 %
878
100 %
0.3 %
Total 41569
16.4 %
100 %
76774
30.4 %
100 %
62595
24.8 %
100 %
71788
28.4 %
100 %
252726
100 %
100 %
χ2=81727.778 · df=9 · Cramer’s V=0.328 · p=0.000

6.19 Finalized Segments

Segment 1: Open-Minded Renters

These individuals (26-64) are the more tech enthusiastic out of all renters, however their smart tech adoption is medium, having smart phones, smart thermostats, and smart home assistants. Although they are excited about technology, they are not likely to purchase smart home technologies in the future. This could be because there is not enough incentive for them to do so (e.g.lowering insurance premiums, simplifying their life, etc.). There is also a huge opportunity to educate these individuals on the ease of use of these technologies as they are middle-aged and live busy lives with most having one to two children in an apartment. While they don’t plan on owning smart home technology, they are considered to be open-minded individuals as they are the most willing segment to share their mobile information, are enthusiastic about technology in general, trust banking/financial apps, and are therefore likely to convert. Overall, they are technologically-enabled, but need that extra push to continue to purchase more smart home technology. Location : Scattered all across Ontario

Segment 2: Low-tech Homeowners

This segment owns homes and falls in the 40-64 age bracket, and are not only slow to adopting smart home technology, but don’t plan on purchasing it in the future. They are home owners and live in 3-4 person households and given that they have medium-to-high home insurance premiums, there is an opportunity to educate them on the impact that smart home technology could have on their premiums. They are not tech enthusiasts, but they currently own smart phone and smart home assistants. Interestingly, they are most likley to consent to mobile information sharing. Location : Northern and Southern Ontario

Segment 3: Migrant Technophiles

These individuals are very excited by technology. They own the most amount of smart technologies, including: smart phone, smart home thermostat, smart home security system, smart home assistant, smart home appliances and lighting. They are also likely to purchase more smart home technology in the future (specifically smart home assistants). Most have 2 children per household, are employed and have a university degrees as their highest level of education (the highest amongst the 4 segments). Notably, they have the highest combined household income of roughly $148,000. They are also the most confident in big businesses as well as banking/financial applications, which is a plus for RBCI. Location : Southern Ontario

Segment 4: Relaxed Retirees

This segment is 65+ and pays high insurance premiums, but is slow to smart home technology adoption. Most have a smart thermostat and a smart home assistant, but don’t plan on purchasing more smart home technology in the future. These individuals (or their children) need a lot of education in order for them to become more technologically-enabled. Considering that they have the highest insurance premiums the relaxed retirees, emphasize opportunities to decrease insurance premiums with the adoption of certain smart home technologies. They are also confident in big businesses, which is aplus for RBCI. Location : In or surounsding cities (non-rural)

7 Conclusions and Marketing Messaging Considerations

Our messaging to the 4 segments will depend on the key benefits derived from Smart home Technologies. We have narrowed the benefits of Smart home technology to five major benefits:

  • Increased security: Smart home technologies can reduce the risk of fire or theft through built-in alert systems. This can be really valuable to people for whom security is the biggest driver
  • Reduced costs: Installing smart home devices may reduce insurance premiums, similar to how installing similar devices in vehicles can reduce car insurance premiums. This aspect of smart home technology will be particularly appealing to people who have low income, or are paying high homeowners insurance premiums.
  • Increased convenience: An ancillary benefit of installing these technologies will be the ability
  • Improved quality-of-life: For people living alone, especially seniors, smart devices can literally save lives. A smart speaker might be able to bring emergency services to the side of a person who just fell down. Improved connectivity will also help with mental health; people who can’t connect with their loved ones, especially under the current circumstances, might hasten adoption of smart home technologies.

Factors influencing the purchase of smart devices

  • Peer pressure: We may be on the edge of a cultural shift similar to the one in the early 2000’s when cell phones went from being a nice-to-have to a must-have. While the immediate pull will likely be slower due to the lower visibility of these items, they may eventually become hallmarks of social progress.
  • Concerns around Data Privacy: With more and more companies using and selling data, people may be hesitant to trust multi-billion dollar conglomerates with their most sensitive data, including voice and behavioural data.

In conclusion, we have identified 4 segments: Open-minded Renters, Low-tech Homeowners, Migrant Technophiles, Relaxed Retirees. All segments should be considered for targeting and marketing considerations are elaborated upon above.

Considering the messaging/ value propositions mentioned above, this is how the marketing strategy and specifically, the messaging would be communicated to each segment:

Segment 1: Open-Minded Renters

  • Marketing Messaging: Reduced costs (premiums), increased convenience
  • Marketing Channel: Mobile marketing

Segment 2: Low-tech Homeowners

  • Marketing Messaging: Reduced costs (premiums), increased convenience
  • Marketing Channel: Traditional Marketing (radio, television, telemarketing)

Segment 3: Migrant Technophiles

  • Marketing Messaging: Increased security, improved QoL
  • Marketing Channel: Mobile marketing

Segment 4: Relaxed Retirees

  • Marketing Messaging: Reduced costs (premiums), improved QoL
  • Marketing Channel: Traditional Marketing (radio, television, telemarketing)

Some suggestions for partnerships that could resultin an offering to clients, of discounts on premiums, free installation, discounts on the actual tech products, include:

  1. Google Nest : Nest smart thermostats, smoke & CO detectors. Benefits: Mitigates risk of fire damage, mitigates risk for damage due to extreme home temperatures

  2. Amazon Ring : Smart doorbells and smart flying security drone. Benefits: Mitigates risk of home damage done by intruders

  3. Amazon Echo : Smart speaker and smart plugs. Benefits: Mitigates risk of small household appliances from staying on for too long and causing home damage

  4. General Electric : Smart appliances. Benefits: Mitigates risk of damage due to, for example, stove/oven being on all day (can shut it off via phone)